ViOTUcluster: A high-speed, all-in-one solution that streamlines the entire virome analysis workflow
Project description
ViOTUcluster: A High-Speed, All-in-One Pipeline for Viromic Analysis from Metagenomic Data
ViOTUcluster is a high-speed, All-in-One solution that streamlines the entire viromics analysis workflow—from raw reads to the generation of viral operational taxonomic units (vOTUs) tables, which include abundance, taxonomy, and quality information, as well as assembled viral genomes, AMG prediction, and host prediction. ViOTUcluster supports the simultaneous processing of multiple samples, efficiently clustering viral sequences across datasets to generate vOTU-related files.
Sihang Liu
Dec 2024
liusihang@tongji.edu.cn
College of Environmental Science and Engineering
Tongji University
Full Text & Citation
See more details in the manuscript on iMetaOmics:
Liu, S., Ye, Y., Guo, B., Hu, Y., Jiang, K., Liang, C., Xia, S. and Wang, H. (2025), ViOTUcluster: A high-speed, All-in-one pipeline for viromic analysis of metagenomic data. iMetaOmics e70023. https://doi.org/10.1002/imo2.70023
Instruction
Demo for using ViOTUcluster
Recorded with asciinema
Important updates
- Version 0.5.5: Added three concurrency controls options,
--max-prediction-tasks (-P),--tpm-tasks (-T),--assemble-jobs (-A), which could help to limit the over memory usage.
Prerequisites
Before installing ViOTUcluster, ensure the following tools are available on your system:
Installation
ViOTUcluster has been tested on Ubuntu and CentOS and should be compatible with all Linux distributions.
First-Time Installation of ViOTUcluster
Follow these steps to install ViOTUcluster for the first time:
ViOTUcluster comes with an all-in-one setup script that pulls three pre-packed Conda environments (ViOTUcluster / vRhyme / DRAM + iPhop) and unpacks them in one shot.
| Option | What it does |
|---|---|
--china |
Switch download source from Zenodo to China SciDB mirrors (faster in mainland CN). |
-p PATH |
Install the whole stack outside your base Conda directory (default is <conda-root>/envs/ViOTUcluster). |
-h, --help |
Show full option list. |
-
Download and Setup ViOTUcluster
ViOTUcluster simplifies the installation of itself and its core dependencies (like vRhyme, DRAM, and iPhop) by providing a setup script that downloads pre-packaged Conda environments.
The setup script can be run directly using
wgetandbash.Default Installation (Recommended for most users, downloads from Zenodo): This command will download the setup script and execute it, which will then download the environment packages from Zenodo.
wget -qO- https://raw.githubusercontent.com/liusihang/ViOTUcluster/master/setup_ViOTUcluster.sh | bash
Alternative for Users in Mainland China (Downloads from China SciDB): If you are in mainland China or experience slow downloads from Zenodo, you can instruct the script to use download mirrors hosted on China SciDB.
wget -qO- https://raw.githubusercontent.com/liusihang/ViOTUcluster/master/setup_ViOTUcluster.sh | bash -s -- --china
For users who lack write access to the Conda base directory or who prefer to install to a custom location:
wget -qO- https://raw.githubusercontent.com/liusihang/ViOTUcluster/master/setup_ViOTUcluster.sh | bash -s -- -p /PATH/YOU/WANT
You can combine flags, for example:
wget -qO- https://raw.githubusercontent.com/liusihang/ViOTUcluster/master/setup_ViOTUcluster.sh | bash -s -- --china -p /PATH/YOU/WANT
Note: When you install to a custom prefix, activate the environment with the full path, e.g.
conda activate /YOUR/CUSTOM/PATH/ViOTUcluster
-
Verify Installation of All Dependencies
To confirm that all required dependencies are correctly installed, run:
conda activate ViOTUcluster pip install --upgrade ViOTUcluster #Important,to keep all script up-to-date. ViOTUcluster_Check
A successful check will produce output similar to this:
Checking dependencies... [✅] fastp is installed. [✅] megahit is installed. [✅] spades.py is installed. [✅] virsorter is installed. [✅] viralverify is installed. [✅] genomad is installed. [✅] checkv is installed. [✅] dRep is installed. [✅] checkm is installed. [✅] bwa is installed. All dependencies are installed.
-
Set Up Databases
ViOTUcluster_download-database "/path/to/db" "num"
If the specified directory (
/path/to/db) does not already contain the required databases, the script will download and install them automatically. Replace/path/to/dbwith your preferred database directory andnumwith the number of threads to use during installation.Note: The setup process involves downloading approximately 30 GB of database files, so the installation time depends heavily on your network speed. A stable, high-speed internet connection is recommended to prevent installation failures.
-
Set Up DRAM and iPhop Environments(Optional for advanced analysis)
Install DRAM Database
To install the DRAM database, first activate the
ViOTUclusterenvironment and then run the setup command:conda activate ViOTUcluster DRAM-setup.py download "/path/to/db/DRAM"
If you have an existing DRAM environment and want to migrate its settings, follow these steps:
-
Export Configuration from the Old Environment:
conda activate old_DRAM_env DRAM-setup.py export_config > my_old_config.txt
-
Import the Configuration into the New Environment:
conda activate ViOTUcluster DRAM-setup.py import_config my_old_config.txt
Install iPhop Database
To install the iPhop database, activate the
ViOTUclusterenvironment and run the database download command:conda activate ViOTUcluster iPhop-setup.py "/path/to/db"
Important Notes
- Database Storage: Ensure that the databases for both DRAM and iPhop are stored in the directory specified during the
ViOTUcluster_download-databasestep. - Expected Database Structure: For details on the expected database structure, refer to the File Structure Example section.
- Official Documentation: For additional instructions on downloading and configuring these databases, refer to the official documentation for:
-
-
Test the Complete ViOTUcluster Workflow with Mini-Samples
To verify ViOTUcluster full workflow are functioning correctly, you can run a test using the
ViOTUcluster_Testcommand with a set of mini FASTQ samples.conda activate ViOTUcluster ViOTUcluster_Test -d /path/to/db
This command will automatically utilize all available threads to execute the entire ViOTUcluster workflow on the provided mini FASTQ samples. Be sure to replace /path/to/db with the path to your database directory.
Updating ViOTUcluster from an Older Version
To update an existing ViOTUcluster installation to the latest version, use pip:
pip install --upgrade ViOTUcluster
This command will upgrade the ViOTUcluster scripts while preserving your existing environment.
Additional Notes
If you run into any difficulties while setting up these environments, feel free to report them by opening an issue on the respective GitHub or Bitbucket repositories for DRAM or iPhop.
How to Use
To run the pipeline, use the following command structure:
-
Create and activate the vRhyme environment
ViOTUcluster -i <input_path_to_contigs> -r <input_path_raw_seqs> -o <output_path> -d <database_path> -n <threads> -m <min-sequence length> --non-con/--con [--reassemble] [--disable-binning] [--max-prediction-tasks <N>] [--tpm-tasks <N>] [--assemble-jobs <N>]
-
Start with raw fastq files
ViOTUcluster_AllinOne -r <input_path_raw_seqs> -o <output_path> -d <database_path> -a <assembly_software> -n <threads> -m <min-sequence length> --non-con/--con [--reassemble] [--disable-binning] [--max-prediction-tasks <N>] [--tpm-tasks <N>] [--assemble-jobs <N>]
A mini test file is available for download at MiniTest.zip. You can use this file in All-in-One mode to verify that the pipeline is successfully installed and functioning.
Parameters
-
-i <input_path_to_contigs>: Specifies the directory containing the assembled contig files in FASTA format (e.g.,example1.fasta). Each contig file should have corresponding raw sequencing FASTQ files in the raw sequence directory, sharing the same prefix. -
*
-r <input_path_raw_seqs>*: Spe cifies the directory with raw sequencing data in FASTQ format. The FASTQ files must have the same prefix as the corresponding contigs file. For example, if the contigs file isexample1.fasta, the FASTQ files should be namedexample1_R1.fqandexample1_R2.fq. The paired-end metagenomic reads should end with.fq,.fq.gz,.fastq, or.fastq.gz. -
-o <output_path>: Defines the output directory for storing the processed results. This will include filtered sequences, prediction outcomes, binning results, and the final dereplicated viral contigs. -
-d <database_path>: Points to the required database for performing viral prediction, binning, and dereplication steps. -
-m, --min-length <length>: Specify the minimum length (bp) for sequences (default: 2500). The same value is applied during initial contig filtering and again before dRep clustering to keep downstream analyses in sync with the user input. -
--non-con/--con: Specifies the viral prediction criteria based on the sample preparation method. Use--non-confor samples that were not enriched using viral-particle concentration methods, typically containing a low viral proportion. Use--confor samples subjected to concentration methods, which are expected to have a medium to high viral proportion. -
--reassemble: (Optional) Enables reassembly of bins after the initial binning process to enhance the accuracy and quality of the final contigs. This feature is still in beta and can significantly increase runtime. -
--disable-binning: Skip the vRhyme binning stage entirely. When enabled, the pipeline copies the per-sample filtered contigs directly into the dereplication and summary steps, which is useful when bins cannot be recovered for some samples. -
-a <assembly_software>: (ForViOTUcluster_AllinOneonly) Specifies the assembly software used during the raw sequence processing. Accepted values are-a megahitor-a metaspades. -
--max-prediction-tasks, -P <N>: Cap total concurrent prediction jobs (e.g., viralverify/virsorter2/genomad), default 30. -
--tpm-tasks, -T <N>: Cap concurrent BAM/TPM processing samples, default 15. -
--assemble-jobs, -A <N>: Cap concurrent assembly samples, default 10.
File Structure Example
Below is a tree list of how the file structure should be organized, assuming the prefix for the example files is example1:
<project_directory>/
│
├── input_contigs/
│ ├── example1.fasta
│ ├── example2.fasta
│ └── ...
│
├── input_fastq/
│ ├── example1_R1.fq
│ ├── example1_R2.fq
│ ├── example2_R1.fq
│ ├── example2_R2.fq
│ └── ...
│
├── output_path/
│ ├── Summary/
│ │ ├── SeperateRes
│ │ │ ├── example1_viralseqs.fasta
│ │ │ ├── example2_viralseqs.fasta
│ │ │ └── ...
│ │ ├── vOTU
│ │ │ ├── vOTU.fasta
│ │ │ ├── vOTU.Abundance.csv
│ │ │ ├── vOTU.Taxonomy.csv
│ │ │ └── CheckVRes
│ │ ├── DRAMRes(Optional)
│ │ │ ├── DRAM_annotations.tsv
│ │ │ └── DRAM_Gene.Abundance.csv
│ │ └── iPhopRes(Optional)
│ └── (IntermediateFile....)
│
└── databases/
├── db/ # VirSorter2 database
├── viralVerify/ # ViralVerify database
├── checkv-db-v1.5/ # CheckV database (version 1.5)
├── genomad_db/ # Genomad database
└── Aug_2023_pub_rw/ # iPhop database
input_contigs/contains the assembled contigs (e.g.,example1.fasta).input_fastq/contains the corresponding FASTQ files (e.g.,example1_R1.fqandexample1_R2.fq).output_results/is the directory where all output files will be stored.databases/contains the required databases for the analysis, including:db/: The VirSorter2 database.ViralVerify/: The ViralVerify database.checkv-db-v1.5/: The CheckV database (version 1.5).genomad_db/: The Genomad database.
Final Output
The processed data is organized under the specified output_path/, with the following structure:
-
output_path/Summary: Contains the final results and summaries for all processed samples, organized into the following subdirectories:SeperateRes: Holds individual directories for each sample (e.g.,example1,example2):<sample>_viralseqs.fasta: The list of predicted viral contigs for the respective sample.
vOTU/: Contains the final processed viral OTU (vOTU) results across all samples:vOTU.fasta: The final dereplicated viral contigs after clustering from all samples.vOTU.Abundance.csv: Abundance data of the vOTUs across samples.vOTU.Taxonomy.csv: Taxonomic assignments for the vOTUs, if available.CheckVRes: Summarized CheckV quality assessments for final vOTUs file.
DRAMRes (Optional): Optional functional annotations from DRAM if the advanced analysis stage is executed.DRAM_annotations.tsv: Aggregated DRAM annotations for all predicted genes.DRAM_Gene.Abundance.csv: TPM-based abundance estimates for each DRAM-predicted gene across samples.
iPhopRes (Optional): Optional results from iPhop annotation if included in the workflow.
-
output_path/IntermediateFile: This directory holds intermediate files generated during the processing pipeline, such as filtered sequences and any temporary data. -
databases/: Stores the necessary databases used for various stages of the analysis:db/: The VirSorter2 database.ViralVerify/: The ViralVerify database, used for viral prediction.checkv-db-v1.5/: The CheckV database (version 1.5) for quality control of viral sequences.genomad_db/: The Genomad database for viral identification and dereplication.
Acknowledgement
ViOTUcluster integrates state-of-the-art viromics analysis tools. The main tools within ViOTUcluster are listed below.
Shifu Chen. 2023. Ultrafast one-pass FASTQ data preprocessing, quality control, and deduplication using fastp. iMeta 2: e107.
MEGAHIT: An ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. Bioinformatics
Prjibelski, A., Antipov, D., Meleshko, D., Lapidus, A., & Korobeynikov, A. (2020). Using SPAdes de novo assembler. Current Protocols in Bioinformatics, 70, e102.
Camargo, Antonio Pedro, Simon Roux, Frederik Schulz, Michal Babinski, Yan Xu, Bin Hu, Patrick SG Chain, Stephen Nayfach, and Nikos C. Kyrpides. "Identification of mobile genetic elements with geNomad." Nature Biotechnology (2023): 1-10.
viralVerify: Online Publication
Dmitry Antipov, Mikhail Raiko, Alla Lapidus, Pavel A Pevzner, MetaviralSPAdes: assembly of viruses from metagenomic data, Bioinformatics, Volume 36, Issue 14, July 2020, Pages 4126–4129
VirSorter2: Online Publication
Guo, Jiarong, Ben Bolduc, Ahmed A. Zayed, Arvind Varsani, Guillermo Dominguez-Huerta, Tom O. Delmont, Akbar Adjie Pratama et al. "VirSorter2: a multi-classifier, expert-guided approach to detect diverse DNA and RNA viruses." Microbiome 9 (2021): 1-13.
Martin Larralde, Georg Zeller, PyHMMER: a Python library binding to HMMER for efficient sequence analysis, Bioinformatics, Volume 39, Issue 5, May 2023, btad214
Nayfach, S., Camargo, A.P., Schulz, F. et al. CheckV assesses the quality and completeness of metagenome-assembled viral genomes. Nat Biotechnol 39, 578–585 (2021)
Kieft, Kristopher, Alyssa Adams, Rauf Salamzade, Lindsay Kalan, and Karthik Anantharaman. "vRhyme enables binning of viral genomes from metagenomes." Nucleic Acids Research 50, no. 14 (2022): e83-e83.
Olm, M., Brown, C., Brooks, B. et al. dRep: a tool for fast and accurate genomic comparisons that enables improved genome recovery from metagenomes through de-replication. ISME J 11, 2864–2868 (2017)
Parks DH, Imelfort M, Skennerton CT, Hugenholtz P, Tyson GW. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res. 2015 Jul;25(7):1043-55
Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM.
Artem Tarasov, Albert J. Vilella, Edwin Cuppen, Isaac J. Nijman, Pjotr Prins, Sambamba: fast processing of NGS alignment formats, Bioinformatics, Volume 31, Issue 12, June 2015, Pages 2032–2034
Michael Shaffer, Mikayla A Borton, Bridget B McGivern, Ahmed A Zayed, Sabina Leanti La Rosa, Lindsey M Solden, Pengfei Liu, Adrienne B Narrowe, Josué Rodríguez-Ramos, Benjamin Bolduc, M Consuelo Gazitúa, Rebecca A Daly, Garrett J Smith, Dean R Vik, Phil B Pope, Matthew B Sullivan, Simon Roux, Kelly C Wrighton, DRAM for distilling microbial metabolism to automate the curation of microbiome function, Nucleic Acids Research, Volume 48, Issue 16, 18 September 2020, Pages 8883–8900
Roux, Simon, Antonio Pedro Camargo, Felipe Hernandes Coutinho, Shareef M. Dabdoub, Bas E. Dutilh, Stephen Nayfach, and Andrew Tritt. "iPHoP: an integrated machine-learning framework to maximize host prediction for metagenome-assembled virus genomes." bioRxiv (2022): 2022-07.
Contact
Feel free to contact Sihang Liu (liusihang@tongji.edu.cn or GitHub Issues) with any questions or comments!
####################################################################################################
██╗ ██╗██╗ ██████╗ ████████╗██╗ ██╗ ██████╗██╗ ██╗ ██╗███████╗████████╗███████╗██████╗
██║ ██║██║██╔═══██╗╚══██╔══╝██║ ██║██╔════╝██║ ██║ ██║██╔════╝╚══██╔══╝██╔════╝██╔══██╗
██║ ██║██║██║ ██║ ██║ ██║ ██║██║ ██║ ██║ ██║███████╗ ██║ █████╗ ██████╔╝
╚██╗ ██╔╝██║██║ ██║ ██║ ██║ ██║██║ ██║ ██║ ██║╚════██║ ██║ ██╔══╝ ██╔══██╗
╚████╔╝ ██║╚██████╔╝ ██║ ╚██████╔╝╚██████╗███████╗╚██████╔╝███████║ ██║ ███████╗██║ ██║
╚═══╝ ╚═╝ ╚═════╝ ╚═╝ ╚═════╝ ╚═════╝╚══════╝ ╚═════╝ ╚══════╝ ╚═╝ ╚══════╝╚═╝ ╚═╝
####################################################################################################
Copyright
ViOTUcluster Copyright (C) 2025
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License, version 2, as published by the Free Software Foundation.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see https://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file viotucluster-0.5.7.1.tar.gz.
File metadata
- Download URL: viotucluster-0.5.7.1.tar.gz
- Upload date:
- Size: 75.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.9.18
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
585f2a4b205c66150de5aa76b09b90915084737e2c0605801438d0cb47b1a17f
|
|
| MD5 |
6ff9224d52b858d9179ba27b7dff0f6c
|
|
| BLAKE2b-256 |
a4556c71bfc2aaac92f78bfaa8016f20d98d55d4d8d1ed36f562e884cbd668d6
|
File details
Details for the file viotucluster-0.5.7.1-py3-none-any.whl.
File metadata
- Download URL: viotucluster-0.5.7.1-py3-none-any.whl
- Upload date:
- Size: 88.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.9.18
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
eb9b51f495e467bd3c79631df8ed680071ae27d60cc93ff5108f9dcb0350c6c0
|
|
| MD5 |
a3b171ca49540e8bd1f1da200120fb44
|
|
| BLAKE2b-256 |
21ee038907828a8b34aa4dccb0b9e659dc9d1c5401bcc5a70d51fd6bce0d4f93
|