Skip to main content

Fast and accurate tool for calculating Average Nucleotide Identity (ANI) and clustering virus genomes and metagenomic contigs

Project description

Vclust logo Vclust

version GitHub downloads Bioconda downloads Build and tests License: GPL v3

x86-64 ARM Apple M Linux macOS

Vclust is an alignment-based tool for fast and accurate calculation of Average Nucleotide Identity (ANI) between complete or metagenomically-assembled viral genomes. The tool also performs ANI-based clustering of genomes according to standards recommended by international virus consortia, including International Committee on Taxonomy of Viruses (ICTV) and Minimum Information about an Uncultivated Virus Genome (MIUViG).

Features

:gem: Accurate ANI calculations

Vclust uses a Lempel-Ziv-based pairwise sequence aligner (LZ-ANI) for ANI calculation. LZ-ANI achieves high sensitivity in detecting matched and mismatched nucleotides, ensuring accurate ANI determination. Its efficiency comes from a simplified indel handling model, making LZ-ANI magnitudes faster than alignment-based tools (e.g., BLASTn, MegaBLAST) while maintaining comparable accuracy to the most sensitive BLASTn searches.

:triangular_ruler: Multiple similarity measures

Vclust offers multiple similarity measures between two genome sequences:

  • ANI: The number of identical nucleotides across local alignments divided by the total length of the alignments.
  • Global ANI (gANI): The number of identical nucleotides across local alignments divided by the length of the query/reference genome.
  • Total ANI (tANI): The number of identical nucleotides between query-reference and reference-query genomes divided by the sum length of both genomes. tANI is equivalent to the VIRIDIC's intergenomic similarity.
  • Coverage (alignment fraction): The proportion of the query/reference sequence aligned with the reference/query sequence.
  • Number of local alignments: The number of local alignments between the two genome sequences.
  • Ratio between genome lengths: The length of the shorter genome divided by the longer one.

:star2: Multiple clustering algorithms

Vclust provides six clustering algorithms tailored to various scenarios, including taxonomic classification and dereplication of viral genomes.

  • Single-linkage
  • Complete-linkage
  • UCLUST
  • CD-HIT (Greedy incremental)
  • Greedy set cover (adopted from MMseqs2)
  • Leiden algorithm [optional]

:fire: Speed and efficiency

Vclust uses three efficient C++ tools - Kmer-db, LZ-ANI, Clusty - for prefiltering, aligning, calculating ANI, and clustering viral genomes. This combination enables the processing of millions of virus genomes within a few hours on a mid-range workstation.

:earth_americas: Web service

For datasets containing up to 1000 viral genomes, Vclust is available at http://www.vclust.org.

Quick start

# Clone repository and build Vclust
git clone --recurse-submodules https://github.com/refresh-bio/vclust
cd vclust && make -j

# Prefilter similar genome sequence pairs before conducting pairwise alignments.
./vclust.py prefilter -i example/multifasta.fna -o fltr.txt

# Align similar genome sequence pairs and calculate pairwise ANI measures.
./vclust.py align -i example/multifasta.fna -o ani.tsv --filter fltr.txt

# Cluster genome sequences based on given ANI measure and minimum threshold.
./vclust.py cluster -i ani.tsv -o clusters.tsv --ids ani.ids.tsv --metric ani --ani 0.95

Documentation

The Vclust documentation is available on the GitHub Wiki and includes the following sections:

  1. Features
  2. Installation
  3. Quick Start
  4. Usage
    1. Input data
    2. Prefilter
    3. Align
    4. Cluster
  5. Optimizing sensitivity and resource usage
  6. Use cases
    1. Classify viruses into species and genera following ICTV standards
    2. Assign viral contigs into vOTUs following MIUViG standards
    3. Dereplicate viral contigs into representative genomes
    4. Calculate pairwise similarities between all-versus-all genomes
    5. Process large dataset of diverse virus genomes (IMG/VR)
    6. Process large dataset of highly redundant virus genomes
    7. Cluster plasmid genomes into pOTUs
  7. FAQ: Frequently Asked Questions

Citation

Zielezinski A, Gudyś A, Barylski J, Siminski K, Rozwalak P, Dutilh BE, Deorowicz S. Ultrafast and accurate sequence alignment and clustering of viral genomes. bioRxiv [doi:10.1101/2024.06.27.601020].

License

GNU General Public License, version 3

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

vclust-1.2.7-py3-none-manylinux2014_x86_64.whl (10.8 MB view details)

Uploaded Python 3

vclust-1.2.7-py3-none-manylinux2014_aarch64.whl (10.1 MB view details)

Uploaded Python 3

vclust-1.2.7-py3-none-macosx_11_0_arm64.whl (3.5 MB view details)

Uploaded Python 3macOS 11.0+ ARM64

vclust-1.2.7-py3-none-macosx_10_9_x86_64.whl (3.7 MB view details)

Uploaded Python 3macOS 10.9+ x86-64

File details

Details for the file vclust-1.2.7-py3-none-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for vclust-1.2.7-py3-none-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 9a12ac2dc17cb5bd8fceb454157e54edc7b9daecb8648c790284f2c75b38a9db
MD5 a17ef87ea138c951c41e17eb1d104af0
BLAKE2b-256 271e9c572d40369df83fa1a90fc383017bc9c60073e90498c9917356f0f3c816

See more details on using hashes here.

File details

Details for the file vclust-1.2.7-py3-none-manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for vclust-1.2.7-py3-none-manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 cedaf07bab49a7f24f804028752941d8ddf7779a0504dbf4c051c632077d65a1
MD5 c9e777db820b08030c9543d2b5deba4a
BLAKE2b-256 e50057ad24ceef8c5801944c44f053c45fa248b24f03a640598de1a6b1e78dc3

See more details on using hashes here.

File details

Details for the file vclust-1.2.7-py3-none-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for vclust-1.2.7-py3-none-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 e91e7e627134085525681e3a670fd25e45a7396471f092d580e1d15c96ac6f27
MD5 b0679c478672bc94b5240de047855f77
BLAKE2b-256 e7fd610bf9ad9e75830ee6488ee6c00eb6226d13001bfcefafd7d06a6604a272

See more details on using hashes here.

File details

Details for the file vclust-1.2.7-py3-none-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for vclust-1.2.7-py3-none-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 6707fcb9c0ab96bffbede0d8dea29612cdc3e1871c010beb3ea654bc53225d7d
MD5 e283220a4c60871d0ba432e5149a7c50
BLAKE2b-256 7baf49e7c45cb24fd62d6f93d30dc2b6ab786671a86594644866aa8c93b0e532

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page