Skip to main content

Fast and accurate tool for calculating Average Nucleotide Identity (ANI) and clustering virus genomes and metagenomic contigs

Project description

Vclust logo Vclust

GitHub Release PyPI - Version Build and tests License: GPL v3

PyPI - Downloads GitHub downloads Bioconda downloads

x86-64 ARM Apple M Linux macOS

Vclust is an alignment-based tool for fast and accurate calculation of Average Nucleotide Identity (ANI) between complete or metagenomically-assembled viral genomes. The tool also performs ANI-based clustering of genomes according to standards recommended by international virus consortia, including International Committee on Taxonomy of Viruses (ICTV) and Minimum Information about an Uncultivated Virus Genome (MIUViG).

Features

:gem: Accurate ANI calculations

Vclust uses a Lempel-Ziv-based pairwise sequence aligner (LZ-ANI) for ANI calculation. LZ-ANI achieves high sensitivity in detecting matched and mismatched nucleotides, ensuring accurate ANI determination. Its efficiency comes from a simplified indel handling model, making LZ-ANI magnitudes faster than alignment-based tools (e.g., BLASTn, MegaBLAST) while maintaining comparable accuracy to the most sensitive BLASTn searches.

:triangular_ruler: Multiple similarity measures

Vclust offers multiple similarity measures between two genome sequences:

  • ANI: The number of identical nucleotides across local alignments divided by the total length of the alignments.
  • Global ANI (gANI): The number of identical nucleotides across local alignments divided by the length of the query/reference genome.
  • Total ANI (tANI): The number of identical nucleotides between query-reference and reference-query genomes divided by the sum length of both genomes. tANI is equivalent to the VIRIDIC's intergenomic similarity.
  • Coverage (alignment fraction): The proportion of the query/reference sequence aligned with the reference/query sequence.
  • Number of local alignments: The number of local alignments between the two genome sequences.
  • Ratio between genome lengths: The length of the shorter genome divided by the longer one.

:star2: Multiple clustering algorithms

Vclust provides six clustering algorithms tailored to various scenarios, including taxonomic classification and dereplication of viral genomes.

  • Single-linkage
  • Complete-linkage
  • UCLUST
  • CD-HIT (Greedy incremental)
  • Greedy set cover (adopted from MMseqs2)
  • Leiden algorithm [optional]

:fire: Speed and efficiency

Vclust uses three efficient C++ tools - Kmer-db, LZ-ANI, Clusty - for prefiltering, aligning, calculating ANI, and clustering viral genomes. This combination enables the processing of millions of virus genomes within a few hours on a mid-range workstation.

:earth_americas: Web service

For datasets containing up to 1000 viral genomes, Vclust is available at http://www.vclust.org.

Quick start

# Install Vclust (requires Python >= 3.7)
pip install vclust

# Prefilter similar genome sequence pairs before conducting pairwise alignments.
vclust prefilter -i example/multifasta.fna -o fltr.txt

# Align similar genome sequence pairs and calculate pairwise ANI measures.
vclust align -i example/multifasta.fna -o ani.tsv --filter fltr.txt

# Cluster genome sequences based on given ANI measure and minimum threshold.
vclust cluster -i ani.tsv -o clusters.tsv --ids ani.ids.tsv --metric ani --ani 0.95

Documentation

The Vclust documentation is available on the GitHub Wiki and includes the following sections:

  1. Features
  2. Installation
  3. Quick Start
  4. Usage
    1. Input data
    2. Prefilter
    3. Align
    4. Cluster
  5. Optimizing sensitivity and resource usage
  6. Use cases
    1. Classify viruses into species and genera following ICTV standards
    2. Assign viral contigs into vOTUs following MIUViG standards
    3. Dereplicate viral contigs into representative genomes
    4. Calculate pairwise similarities between all-versus-all genomes
    5. Process large dataset of diverse virus genomes (IMG/VR)
    6. Process large dataset of highly redundant virus genomes
    7. Cluster plasmid genomes into pOTUs
  7. FAQ: Frequently Asked Questions

Citation

Zielezinski A, Gudyś A, Barylski J, Siminski K, Rozwalak P, Dutilh BE, Deorowicz S. Ultrafast and accurate sequence alignment and clustering of viral genomes. bioRxiv [doi:10.1101/2024.06.27.601020].

License

GNU General Public License, version 3

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

vclust-1.2.8-py3-none-manylinux2014_x86_64.whl (10.8 MB view details)

Uploaded Python 3

vclust-1.2.8-py3-none-manylinux2014_aarch64.whl (10.1 MB view details)

Uploaded Python 3

vclust-1.2.8-py3-none-macosx_11_0_arm64.whl (3.5 MB view details)

Uploaded Python 3macOS 11.0+ ARM64

vclust-1.2.8-py3-none-macosx_10_9_x86_64.whl (3.7 MB view details)

Uploaded Python 3macOS 10.9+ x86-64

File details

Details for the file vclust-1.2.8-py3-none-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for vclust-1.2.8-py3-none-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 fcc62e88de1d87bca792ed08845f0876ac4647ac80d6bbdc66585600684a9659
MD5 559ab28502afb76b022929051e87a083
BLAKE2b-256 70b3a7dc474d515f511564ba31f4cae44a23f7badc687151a254d54d2ad1936b

See more details on using hashes here.

File details

Details for the file vclust-1.2.8-py3-none-manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for vclust-1.2.8-py3-none-manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 f77df5fae8771d15cb99f8862b780237a2df3903243ba2cf4fa7dbece995f349
MD5 4d61cc4be7ce3a44dc7fe6418080381b
BLAKE2b-256 5cf409e3be13268d857ea9cb227e5324cbb23bd7cdb47efa48fb9cf591619727

See more details on using hashes here.

File details

Details for the file vclust-1.2.8-py3-none-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for vclust-1.2.8-py3-none-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 8c70fd4b684c233743c98b5dc1004ad0a4c1c9ad793278a2337a26962dfe1ad6
MD5 b8eb0a3741d9d2320819984efaea6952
BLAKE2b-256 0ea8d35b83baff04e299b3600b91b3f4f5df9aa08210381198ee5ab814e6881c

See more details on using hashes here.

File details

Details for the file vclust-1.2.8-py3-none-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for vclust-1.2.8-py3-none-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 a5f381652b92c83dd22b111faa42e37f0ae564e579d586e2e1164cd2e968fbf3
MD5 ce443a9e816d5c03e69d8935e2b38a46
BLAKE2b-256 5226a349c6911bee8b14ec3a1919ab416c759768ec81a41566f2fc42399303f5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page