Skip to main content

ViOTUcluster: A high-speed, all-in-one solution that streamlines the entire virome analysis workflow

Project description

ViOTUcluster: A High-Speed, All-in-One Pipeline for Viromic Analysis from Metagenomic Data

PyPI - Version

ViOTUcluster is a high-speed, All-in-One solution that streamlines the entire viromics analysis workflow—from raw reads to the generation of viral operational taxonomic units (vOTUs) tables, which include abundance, taxonomy, and quality information, as well as assembled viral genomes, AMG prediction, and host prediction. ViOTUcluster supports the simultaneous processing of multiple samples, efficiently clustering viral sequences across datasets to generate vOTU-related files.

alt text

Sihang Liu
Dec 2024   
liusihang@tongji.edu.cn
College of Environmental Science and Engineering
Tongji University 

Full Text & Citation

See more details in the manuscript on iMetaOmics:

Liu, S., Ye, Y., Guo, B., Hu, Y., Jiang, K., Liang, C., Xia, S. and Wang, H. (2025), ViOTUcluster: A high-speed, All-in-one pipeline for viromic analysis of metagenomic data. iMetaOmics e70023. https://doi.org/10.1002/imo2.70023

Instruction

Demo for using ViOTUcluster

asciicast

Recorded with asciinema

  1. Prerequisites
  2. Installation
  3. How to Use
  4. File Structure Example
  5. Final Output
  6. Contact

Important updates

  • Version 0.5.5: Added three concurrency controls options,--max-prediction-tasks (-P), --tpm-tasks (-T), --assemble-jobs (-A), which could help to limit the over memory usage.

Prerequisites

Before installing ViOTUcluster, ensure the following tools are available on your system:

Installation

ViOTUcluster has been tested on Ubuntu and CentOS and should be compatible with all Linux distributions.

First-Time Installation of ViOTUcluster

Follow these steps to install ViOTUcluster for the first time:

ViOTUcluster comes with an all-in-one setup script that pulls three pre-packed Conda environments (ViOTUcluster / vRhyme / DRAM + iPhop) and unpacks them in one shot.

Option What it does
--china Switch download source from Zenodo to China SciDB mirrors (faster in mainland CN).
-p PATH Install the whole stack outside your base Conda directory (default is <conda-root>/envs/ViOTUcluster).
-h, --help Show full option list.

  1. Download and Setup ViOTUcluster

    ViOTUcluster simplifies the installation of itself and its core dependencies (like vRhyme, DRAM, and iPhop) by providing a setup script that downloads pre-packaged Conda environments.

    The setup script can be run directly using wget and bash.

    Default Installation (Recommended for most users, downloads from Zenodo): This command will download the setup script and execute it, which will then download the environment packages from Zenodo.

    wget -qO- https://raw.githubusercontent.com/liusihang/ViOTUcluster/master/setup_ViOTUcluster.sh | bash
    

    Alternative for Users in Mainland China (Downloads from China SciDB): If you are in mainland China or experience slow downloads from Zenodo, you can instruct the script to use download mirrors hosted on China SciDB.

    wget -qO- https://raw.githubusercontent.com/liusihang/ViOTUcluster/master/setup_ViOTUcluster.sh | bash -s -- --china
    

    For users who lack write access to the Conda base directory or who prefer to install to a custom location:

    wget -qO- https://raw.githubusercontent.com/liusihang/ViOTUcluster/master/setup_ViOTUcluster.sh | bash -s -- -p /PATH/YOU/WANT
    

    You can combine flags, for example:

    wget -qO- https://raw.githubusercontent.com/liusihang/ViOTUcluster/master/setup_ViOTUcluster.sh | bash -s -- --china -p /PATH/YOU/WANT
    

    Note: When you install to a custom prefix, activate the environment with the full path, e.g.

    conda activate /YOUR/CUSTOM/PATH/ViOTUcluster
    
  2. Verify Installation of All Dependencies

    To confirm that all required dependencies are correctly installed, run:

    conda activate ViOTUcluster
    pip install --upgrade ViOTUcluster #Important,to keep all script up-to-date.
    ViOTUcluster_Check
    

    A successful check will produce output similar to this:

    Checking dependencies...
    [] fastp is installed.
    [] megahit is installed.
    [] spades.py is installed.
    [] virsorter is installed.
    [] viralverify is installed.
    [] genomad is installed.
    [] checkv is installed.
    [] dRep is installed.
    [] checkm is installed.
    [] bwa is installed.
    All dependencies are installed.
    
  3. Set Up Databases

    ViOTUcluster_download-database "/path/to/db" "num"
    

    If the specified directory (/path/to/db) does not already contain the required databases, the script will download and install them automatically. Replace /path/to/db with your preferred database directory and num with the number of threads to use during installation.

    Note: The setup process involves downloading approximately 30 GB of database files, so the installation time depends heavily on your network speed. A stable, high-speed internet connection is recommended to prevent installation failures.

  4. Set Up DRAM and iPhop Environments(Optional for advanced analysis)

    Install DRAM Database

    To install the DRAM database, first activate the ViOTUcluster environment and then run the setup command:

    conda activate ViOTUcluster
    DRAM-setup.py download "/path/to/db/DRAM"
    

    If you have an existing DRAM environment and want to migrate its settings, follow these steps:

    1. Export Configuration from the Old Environment:

      conda activate old_DRAM_env
      DRAM-setup.py export_config > my_old_config.txt
      
    2. Import the Configuration into the New Environment:

      conda activate ViOTUcluster
      DRAM-setup.py import_config my_old_config.txt
      

    Install iPhop Database

    To install the iPhop database, activate the ViOTUcluster environment and run the database download command:

    conda activate ViOTUcluster
    iPhop-setup.py "/path/to/db"
    

    Important Notes

    • Database Storage: Ensure that the databases for both DRAM and iPhop are stored in the directory specified during the ViOTUcluster_download-database step.
    • Expected Database Structure: For details on the expected database structure, refer to the File Structure Example section.
    • Official Documentation: For additional instructions on downloading and configuring these databases, refer to the official documentation for:
  5. Test the Complete ViOTUcluster Workflow with Mini-Samples

    To verify ViOTUcluster full workflow are functioning correctly, you can run a test using the ViOTUcluster_Test command with a set of mini FASTQ samples.

    conda activate ViOTUcluster
    ViOTUcluster_Test -d /path/to/db
    

    This command will automatically utilize all available threads to execute the entire ViOTUcluster workflow on the provided mini FASTQ samples. Be sure to replace /path/to/db with the path to your database directory.

Updating ViOTUcluster from an Older Version

To update an existing ViOTUcluster installation to the latest version, use pip:

pip install --upgrade ViOTUcluster

This command will upgrade the ViOTUcluster scripts while preserving your existing environment.

Additional Notes

If you run into any difficulties while setting up these environments, feel free to report them by opening an issue on the respective GitHub or Bitbucket repositories for DRAM or iPhop.


How to Use

To run the pipeline, use the following command structure:

  1. Create and activate the vRhyme environment

    ViOTUcluster -i <input_path_to_contigs> -r <input_path_raw_seqs> -o <output_path> -d <database_path> -n <threads> -m <min-sequence length> --non-con/--con [--reassemble] [--disable-binning] [--max-prediction-tasks <N>] [--tpm-tasks <N>] [--assemble-jobs <N>]
    
  2. Start with raw fastq files

    ViOTUcluster_AllinOne -r <input_path_raw_seqs> -o <output_path> -d <database_path> -a <assembly_software> -n <threads> -m <min-sequence length> --non-con/--con [--reassemble] [--disable-binning] [--max-prediction-tasks <N>] [--tpm-tasks <N>] [--assemble-jobs <N>]
    

A mini test file is available for download at MiniTest.zip. You can use this file in All-in-One mode to verify that the pipeline is successfully installed and functioning.

Parameters

  • -i <input_path_to_contigs>: Specifies the directory containing the assembled contig files in FASTA format (e.g., example1.fasta). Each contig file should have corresponding raw sequencing FASTQ files in the raw sequence directory, sharing the same prefix.

  • *-r <input_path_raw_seqs> *: Spe cifies the directory with raw sequencing data in FASTQ format. The FASTQ files must have the same prefix as the corresponding contigs file. For example, if the contigs file is example1.fasta, the FASTQ files should be named example1_R1.fq and example1_R2.fq. The paired-end metagenomic reads should end with .fq, .fq.gz, .fastq, or .fastq.gz.

  • -o <output_path>: Defines the output directory for storing the processed results. This will include filtered sequences, prediction outcomes, binning results, and the final dereplicated viral contigs.

  • -d <database_path>: Points to the required database for performing viral prediction, binning, and dereplication steps.

  • -m, --min-length <length>: Specify the minimum length (bp) for sequences (default: 2500). The same value is applied during initial contig filtering and again before dRep clustering to keep downstream analyses in sync with the user input.

  • --non-con/--con: Specifies the viral prediction criteria based on the sample preparation method. Use --non-con for samples that were not enriched using viral-particle concentration methods, typically containing a low viral proportion. Use --con for samples subjected to concentration methods, which are expected to have a medium to high viral proportion.

  • --reassemble: (Optional) Enables reassembly of bins after the initial binning process to enhance the accuracy and quality of the final contigs. This feature is still in beta and can significantly increase runtime.

  • --disable-binning: Skip the vRhyme binning stage entirely. When enabled, the pipeline copies the per-sample filtered contigs directly into the dereplication and summary steps, which is useful when bins cannot be recovered for some samples.

  • -a <assembly_software>: (For ViOTUcluster_AllinOne only) Specifies the assembly software used during the raw sequence processing. Accepted values are -a megahit or -a metaspades.

  • --max-prediction-tasks, -P <N>: Cap total concurrent prediction jobs (e.g., viralverify/virsorter2/genomad), default 30.

  • --tpm-tasks, -T <N>: Cap concurrent BAM/TPM processing samples, default 15.

  • --assemble-jobs, -A <N>: Cap concurrent assembly samples, default 10.

File Structure Example

Below is a tree list of how the file structure should be organized, assuming the prefix for the example files is example1:

<project_directory>/
│
├── input_contigs/
│   ├── example1.fasta
│   ├── example2.fasta
│   └── ...
│
├── input_fastq/
│   ├── example1_R1.fq
│   ├── example1_R2.fq
│   ├── example2_R1.fq
│   ├── example2_R2.fq
│   └── ...
│
├── output_path/
│   ├── Summary/
│   │   ├── SeperateRes
│   │   │   ├── example1_viralseqs.fasta
│   │   │   ├── example2_viralseqs.fasta
│   │   │   └── ... 
│   │   ├── vOTU
│   │   │    ├── vOTU.fasta
│   │   │    ├── vOTU.Abundance.csv
│   │   │    ├── vOTU.Taxonomy.csv
│   │   │    └── CheckVRes
│   │   ├── DRAMRes(Optional)
│   │   │    ├── DRAM_annotations.tsv
│   │   │    └── DRAM_Gene.Abundance.csv
│   │   └── iPhopRes(Optional)
│   └── (IntermediateFile....)
│
└── databases/
    ├── db/                # VirSorter2 database
    ├── viralVerify/       # ViralVerify database
    ├── checkv-db-v1.5/    # CheckV database (version 1.5)
    ├── genomad_db/        # Genomad database
    └── Aug_2023_pub_rw/   # iPhop database
  • input_contigs/ contains the assembled contigs (e.g., example1.fasta).
  • input_fastq/ contains the corresponding FASTQ files (e.g., example1_R1.fq and example1_R2.fq).
  • output_results/ is the directory where all output files will be stored.
  • databases/ contains the required databases for the analysis, including:
    • db/: The VirSorter2 database.
    • ViralVerify/: The ViralVerify database.
    • checkv-db-v1.5/: The CheckV database (version 1.5).
    • genomad_db/: The Genomad database.

Final Output

The processed data is organized under the specified output_path/, with the following structure:

  • output_path/Summary: Contains the final results and summaries for all processed samples, organized into the following subdirectories:

    • SeperateRes: Holds individual directories for each sample (e.g., example1, example2):
      • <sample>_viralseqs.fasta: The list of predicted viral contigs for the respective sample.
    • vOTU/: Contains the final processed viral OTU (vOTU) results across all samples:
      • vOTU.fasta: The final dereplicated viral contigs after clustering from all samples.
      • vOTU.Abundance.csv: Abundance data of the vOTUs across samples.
      • vOTU.Taxonomy.csv: Taxonomic assignments for the vOTUs, if available.
      • CheckVRes: Summarized CheckV quality assessments for final vOTUs file.
    • DRAMRes (Optional): Optional functional annotations from DRAM if the advanced analysis stage is executed.
      • DRAM_annotations.tsv: Aggregated DRAM annotations for all predicted genes.
      • DRAM_Gene.Abundance.csv: TPM-based abundance estimates for each DRAM-predicted gene across samples.
    • iPhopRes (Optional): Optional results from iPhop annotation if included in the workflow.
  • output_path/IntermediateFile: This directory holds intermediate files generated during the processing pipeline, such as filtered sequences and any temporary data.

  • databases/: Stores the necessary databases used for various stages of the analysis:

    • db/: The VirSorter2 database.
    • ViralVerify/: The ViralVerify database, used for viral prediction.
    • checkv-db-v1.5/: The CheckV database (version 1.5) for quality control of viral sequences.
    • genomad_db/: The Genomad database for viral identification and dereplication.

Acknowledgement

ViOTUcluster integrates state-of-the-art viromics analysis tools. The main tools within ViOTUcluster are listed below.

fastp: Online Publication

Shifu Chen. 2023. Ultrafast one-pass FASTQ data preprocessing, quality control, and deduplication using fastp. iMeta 2: e107.

MEGAHIT: Online Publication

MEGAHIT: An ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. Bioinformatics

SPAdes: Online Publication

Prjibelski, A., Antipov, D., Meleshko, D., Lapidus, A., & Korobeynikov, A. (2020). Using SPAdes de novo assembler. Current Protocols in Bioinformatics, 70, e102. 

geNomad: Online Publication

Camargo, Antonio Pedro, Simon Roux, Frederik Schulz, Michal Babinski, Yan Xu, Bin Hu, Patrick SG Chain, Stephen Nayfach, and Nikos C. Kyrpides. "Identification of mobile genetic elements with geNomad." Nature Biotechnology (2023): 1-10. 

viralVerify: Online Publication

Dmitry Antipov, Mikhail Raiko, Alla Lapidus, Pavel A Pevzner, MetaviralSPAdes: assembly of viruses from metagenomic data, Bioinformatics, Volume 36, Issue 14, July 2020, Pages 4126–4129

VirSorter2: Online Publication

Guo, Jiarong, Ben Bolduc, Ahmed A. Zayed, Arvind Varsani, Guillermo Dominguez-Huerta, Tom O. Delmont, Akbar Adjie Pratama et al. "VirSorter2: a multi-classifier, expert-guided approach to detect diverse DNA and RNA viruses." Microbiome 9 (2021): 1-13.

PyHMMER: Online Publication

Martin Larralde, Georg Zeller, PyHMMER: a Python library binding to HMMER for efficient sequence analysis, Bioinformatics, Volume 39, Issue 5, May 2023, btad214

CheckV: Online Publication

Nayfach, S., Camargo, A.P., Schulz, F. et al. CheckV assesses the quality and completeness of metagenome-assembled viral genomes. Nat Biotechnol 39, 578–585 (2021)

vRhyme: Online Publication

Kieft, Kristopher, Alyssa Adams, Rauf Salamzade, Lindsay Kalan, and Karthik Anantharaman. "vRhyme enables binning of viral genomes from metagenomes." Nucleic Acids Research 50, no. 14 (2022): e83-e83.

dRep: Online Publication

Olm, M., Brown, C., Brooks, B. et al. dRep: a tool for fast and accurate genomic comparisons that enables improved genome recovery from metagenomes through de-replication. ISME J 11, 2864–2868 (2017)

CheckM: Online Publication

Parks DH, Imelfort M, Skennerton CT, Hugenholtz P, Tyson GW. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res. 2015 Jul;25(7):1043-55

BWA: Online Publication

Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM.

Sambamba: Online Publication

Artem Tarasov, Albert J. Vilella, Edwin Cuppen, Isaac J. Nijman, Pjotr Prins, Sambamba: fast processing of NGS alignment formats, Bioinformatics, Volume 31, Issue 12, June 2015, Pages 2032–2034

DRAM: Online Publication

Michael Shaffer, Mikayla A Borton, Bridget B McGivern, Ahmed A Zayed, Sabina Leanti La Rosa, Lindsey M Solden, Pengfei Liu, Adrienne B Narrowe, Josué Rodríguez-Ramos, Benjamin Bolduc, M Consuelo Gazitúa, Rebecca A Daly, Garrett J Smith, Dean R Vik, Phil B Pope, Matthew B Sullivan, Simon Roux, Kelly C Wrighton, DRAM for distilling microbial metabolism to automate the curation of microbiome function, Nucleic Acids Research, Volume 48, Issue 16, 18 September 2020, Pages 8883–8900

iPHoP: Online Publication

Roux, Simon, Antonio Pedro Camargo, Felipe Hernandes Coutinho, Shareef M. Dabdoub, Bas E. Dutilh, Stephen Nayfach, and Andrew Tritt. "iPHoP: an integrated machine-learning framework to maximize host prediction for metagenome-assembled virus genomes." bioRxiv (2022): 2022-07.

Contact

Feel free to contact Sihang Liu (liusihang@tongji.edu.cn or GitHub Issues) with any questions or comments!

####################################################################################################
██╗   ██╗██╗ ██████╗ ████████╗██╗   ██╗ ██████╗██╗     ██╗   ██╗███████╗████████╗███████╗██████╗ 
██║   ██║██║██╔═══██╗╚══██╔══╝██║   ██║██╔════╝██║     ██║   ██║██╔════╝╚══██╔══╝██╔════╝██╔══██╗
██║   ██║██║██║   ██║   ██║   ██║   ██║██║     ██║     ██║   ██║███████╗   ██║   █████╗  ██████╔╝
╚██╗ ██╔╝██║██║   ██║   ██║   ██║   ██║██║     ██║     ██║   ██║╚════██║   ██║   ██╔══╝  ██╔══██╗
 ╚████╔╝ ██║╚██████╔╝   ██║   ╚██████╔╝╚██████╗███████╗╚██████╔╝███████║   ██║   ███████╗██║  ██║
  ╚═══╝  ╚═╝ ╚═════╝    ╚═╝    ╚═════╝  ╚═════╝╚══════╝ ╚═════╝ ╚══════╝   ╚═╝   ╚══════╝╚═╝  ╚═╝
####################################################################################################

Copyright

ViOTUcluster Copyright (C) 2025

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License, version 2, as published by the Free Software Foundation.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see https://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

viotucluster-0.5.7.1.tar.gz (75.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

viotucluster-0.5.7.1-py3-none-any.whl (88.0 kB view details)

Uploaded Python 3

File details

Details for the file viotucluster-0.5.7.1.tar.gz.

File metadata

  • Download URL: viotucluster-0.5.7.1.tar.gz
  • Upload date:
  • Size: 75.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.18

File hashes

Hashes for viotucluster-0.5.7.1.tar.gz
Algorithm Hash digest
SHA256 585f2a4b205c66150de5aa76b09b90915084737e2c0605801438d0cb47b1a17f
MD5 6ff9224d52b858d9179ba27b7dff0f6c
BLAKE2b-256 a4556c71bfc2aaac92f78bfaa8016f20d98d55d4d8d1ed36f562e884cbd668d6

See more details on using hashes here.

File details

Details for the file viotucluster-0.5.7.1-py3-none-any.whl.

File metadata

  • Download URL: viotucluster-0.5.7.1-py3-none-any.whl
  • Upload date:
  • Size: 88.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.18

File hashes

Hashes for viotucluster-0.5.7.1-py3-none-any.whl
Algorithm Hash digest
SHA256 eb9b51f495e467bd3c79631df8ed680071ae27d60cc93ff5108f9dcb0350c6c0
MD5 a3b171ca49540e8bd1f1da200120fb44
BLAKE2b-256 21ee038907828a8b34aa4dccb0b9e659dc9d1c5401bcc5a70d51fd6bce0d4f93

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page