ViOTUcluster: A high-speed, all-in-one solution that streamlines the entire virome analysis workflow

Project description

ViOTUcluster: A High-Speed, All-in-One Pipeline for Viromic Analysis from Metagenomic Data

ViOTUcluster is a high-speed, All-in-One solution that streamlines the entire viromics analysis workflow—from raw reads to the generation of viral operational taxonomic units (vOTUs) tables, which include abundance, taxonomy, and quality information, as well as assembled viral genomes, AMG prediction, and host prediction. ViOTUcluster supports the simultaneous processing of multiple samples, efficiently clustering viral sequences across datasets to generate vOTU-related files.

alt text

Sihang Liu
Dec 2024   
liusihang@tongji.edu.cn
College of Environmental Science and Engineering
Tongji University

Full Text & Citation

See more details in the manuscript on iMetaOmics:

Liu, S., Ye, Y., Guo, B., Hu, Y., Jiang, K., Liang, C., Xia, S. and Wang, H. (2025), ViOTUcluster: A high-speed, All-in-one pipeline for viromic analysis of metagenomic data. iMetaOmics e70023. https://doi.org/10.1002/imo2.70023

Instruction

Demo for using ViOTUcluster

Recorded with asciinema

Prerequisites
Installation
How to Use
File Structure Example
Final Output
Contact

Important updates

Version 0.5.5: Added three concurrency controls options,--max-prediction-tasks (-P), --tpm-tasks (-T), --assemble-jobs (-A), which could help to limit the over memory usage.

Prerequisites

Before installing ViOTUcluster, ensure the following tools are available on your system:

Miniconda or Anaconda
mamba (recommended for faster package management)
Git

Installation

ViOTUcluster has been tested on Ubuntu and CentOS and should be compatible with all Linux distributions.

First-Time Installation of ViOTUcluster

Follow these steps to install ViOTUcluster for the first time:

ViOTUcluster comes with an all-in-one setup script that pulls three pre-packed Conda environments (ViOTUcluster / vRhyme / DRAM + iPhop) and unpacks them in one shot.

Option	What it does
`--china`	Switch download source from Zenodo to China SciDB mirrors (faster in mainland CN).
`-p PATH`	Install the whole stack outside your base Conda directory (default is `<conda-root>/envs/ViOTUcluster`).
`-h`, `--help`	Show full option list.

Download and Setup ViOTUcluster

ViOTUcluster simplifies the installation of itself and its core dependencies (like vRhyme, DRAM, and iPhop) by providing a setup script that downloads pre-packaged Conda environments.

The setup script can be run directly using wget and bash.

Default Installation (Recommended for most users, downloads from Zenodo): This command will download the setup script and execute it, which will then download the environment packages from Zenodo.
```
wget -qO- https://raw.githubusercontent.com/liusihang/ViOTUcluster/master/setup_ViOTUcluster.sh | bash
```
Alternative for Users in Mainland China (Downloads from China SciDB): If you are in mainland China or experience slow downloads from Zenodo, you can instruct the script to use download mirrors hosted on China SciDB.
```
wget -qO- https://raw.githubusercontent.com/liusihang/ViOTUcluster/master/setup_ViOTUcluster.sh | bash -s -- --china
```
For users who lack write access to the Conda base directory or who prefer to install to a custom location:
```
wget -qO- https://raw.githubusercontent.com/liusihang/ViOTUcluster/master/setup_ViOTUcluster.sh | bash -s -- -p /PATH/YOU/WANT
```
You can combine flags, for example:
```
wget -qO- https://raw.githubusercontent.com/liusihang/ViOTUcluster/master/setup_ViOTUcluster.sh | bash -s -- --china -p /PATH/YOU/WANT
```
Note: When you install to a custom prefix, activate the environment with the full path, e.g.
```
conda activate /YOUR/CUSTOM/PATH/ViOTUcluster
```

Verify Installation of All Dependencies

To confirm that all required dependencies are correctly installed, run:

conda activate ViOTUcluster
pip install --upgrade ViOTUcluster #Important，to keep all script up-to-date.
ViOTUcluster_Check

A successful check will produce output similar to this:

Checking dependencies...
[✅] fastp is installed.
[✅] megahit is installed.
[✅] spades.py is installed.
[✅] virsorter is installed.
[✅] viralverify is installed.
[✅] genomad is installed.
[✅] checkv is installed.
[✅] dRep is installed.
[✅] checkm is installed.
[✅] bwa is installed.
All dependencies are installed.

Set Up Databases
```
ViOTUcluster_download-database "/path/to/db" "num"
```
If the specified directory (/path/to/db) does not already contain the required databases, the script will download and install them automatically. Replace /path/to/db with your preferred database directory and num with the number of threads to use during installation.

Note: The setup process involves downloading approximately 30 GB of database files, so the installation time depends heavily on your network speed. A stable, high-speed internet connection is recommended to prevent installation failures.
Set Up DRAM and iPhop Environments（Optional for advanced analysis）

Install DRAM Database

To install the DRAM database, first activate the ViOTUcluster environment and then run the setup command:
```
conda activate ViOTUcluster
DRAM-setup.py download "/path/to/db/DRAM"
```
If you have an existing DRAM environment and want to migrate its settings, follow these steps:
1. Export Configuration from the Old Environment:
```
conda activate old_DRAM_env
DRAM-setup.py export_config > my_old_config.txt
```
2. Import the Configuration into the New Environment:
```
conda activate ViOTUcluster
DRAM-setup.py import_config my_old_config.txt
```
Install iPhop Database

To install the iPhop database, activate the ViOTUcluster environment and run the database download command:
```
conda activate ViOTUcluster
iPhop-setup.py "/path/to/db"
```
Important Notes
- Database Storage: Ensure that the databases for both DRAM and iPhop are stored in the directory specified during the ViOTUcluster_download-database step.
- Expected Database Structure: For details on the expected database structure, refer to the File Structure Example section.
- Official Documentation: For additional instructions on downloading and configuring these databases, refer to the official documentation for:
  - DRAM
  - iPhop.
Test the Complete ViOTUcluster Workflow with Mini-Samples

To verify ViOTUcluster full workflow are functioning correctly, you can run a test using the ViOTUcluster_Test command with a set of mini FASTQ samples.
```
conda activate ViOTUcluster
ViOTUcluster_Test -d /path/to/db
```
This command will automatically utilize all available threads to execute the entire ViOTUcluster workflow on the provided mini FASTQ samples. Be sure to replace /path/to/db with the path to your database directory.

Updating ViOTUcluster from an Older Version

To update an existing ViOTUcluster installation to the latest version, use pip:

pip install --upgrade ViOTUcluster

This command will upgrade the ViOTUcluster scripts while preserving your existing environment.

Additional Notes

If you run into any difficulties while setting up these environments, feel free to report them by opening an issue on the respective GitHub or Bitbucket repositories for DRAM or iPhop.

How to Use

To run the pipeline, use the following command structure:

Create and activate the vRhyme environment

ViOTUcluster -i <input_path_to_contigs> -r <input_path_raw_seqs> -o <output_path> -d <database_path> -n <threads> -m <min-sequence length> --non-con/--con [--reassemble] [--disable-binning] [--max-prediction-tasks <N>] [--tpm-tasks <N>] [--assemble-jobs <N>]

Start with raw fastq files

ViOTUcluster_AllinOne -r <input_path_raw_seqs> -o <output_path> -d <database_path> -a <assembly_software> -n <threads> -m <min-sequence length> --non-con/--con [--reassemble] [--disable-binning] [--max-prediction-tasks <N>] [--tpm-tasks <N>] [--assemble-jobs <N>]

A mini test file is available for download at MiniTest.zip. You can use this file in All-in-One mode to verify that the pipeline is successfully installed and functioning.

Parameters

-i <input_path_to_contigs>: Specifies the directory containing the assembled contig files in FASTA format (e.g., example1.fasta). Each contig file should have corresponding raw sequencing FASTQ files in the raw sequence directory, sharing the same prefix.
*-r <input_path_raw_seqs> *: Spe cifies the directory with raw sequencing data in FASTQ format. The FASTQ files must have the same prefix as the corresponding contigs file. For example, if the contigs file is example1.fasta, the FASTQ files should be named example1_R1.fq and example1_R2.fq. The paired-end metagenomic reads should end with .fq, .fq.gz, .fastq, or .fastq.gz.
-o <output_path>: Defines the output directory for storing the processed results. This will include filtered sequences, prediction outcomes, binning results, and the final dereplicated viral contigs.
-d <database_path>: Points to the required database for performing viral prediction, binning, and dereplication steps.
-m, --min-length <length>: Specify the minimum length (bp) for sequences (default: 2500). The same value is applied during initial contig filtering and again before dRep clustering to keep downstream analyses in sync with the user input.
--non-con/--con: Specifies the viral prediction criteria based on the sample preparation method. Use --non-con for samples that were not enriched using viral-particle concentration methods, typically containing a low viral proportion. Use --con for samples subjected to concentration methods, which are expected to have a medium to high viral proportion.
--reassemble: (Optional) Enables reassembly of bins after the initial binning process to enhance the accuracy and quality of the final contigs. This feature is still in beta and can significantly increase runtime.
--disable-binning: Skip the vRhyme binning stage entirely. When enabled, the pipeline copies the per-sample filtered contigs directly into the dereplication and summary steps, which is useful when bins cannot be recovered for some samples.
-a <assembly_software>: (For ViOTUcluster_AllinOne only) Specifies the assembly software used during the raw sequence processing. Accepted values are -a megahit or -a metaspades.
--max-prediction-tasks, -P <N>: Cap total concurrent prediction jobs (e.g., viralverify/virsorter2/genomad), default 30.
--tpm-tasks, -T <N>: Cap concurrent BAM/TPM processing samples, default 15.
--assemble-jobs, -A <N>: Cap concurrent assembly samples, default 10.

File Structure Example

Below is a tree list of how the file structure should be organized, assuming the prefix for the example files is example1:

<project_directory>/
│
├── input_contigs/
│   ├── example1.fasta
│   ├── example2.fasta
│   └── ...
│
├── input_fastq/
│   ├── example1_R1.fq
│   ├── example1_R2.fq
│   ├── example2_R1.fq
│   ├── example2_R2.fq
│   └── ...
│
├── output_path/
│   ├── Summary/
│   │   ├── SeperateRes
│   │   │   ├── example1_viralseqs.fasta
│   │   │   ├── example2_viralseqs.fasta
│   │   │   └── ... 
│   │   ├── vOTU
│   │   │    ├── vOTU.fasta
│   │   │    ├── vOTU.Abundance.csv
│   │   │    ├── vOTU.Taxonomy.csv
│   │   │    └── CheckVRes
│   │   ├── DRAMRes(Optional)
│   │   │    ├── DRAM_annotations.tsv
│   │   │    └── DRAM_Gene.Abundance.csv
│   │   └── iPhopRes(Optional)
│   └── (IntermediateFile....)
│
└── databases/
    ├── db/                # VirSorter2 database
    ├── viralVerify/       # ViralVerify database
    ├── checkv-db-v1.5/    # CheckV database (version 1.5)
    ├── genomad_db/        # Genomad database
    └── Aug_2023_pub_rw/   # iPhop database

input_contigs/ contains the assembled contigs (e.g., example1.fasta).
input_fastq/ contains the corresponding FASTQ files (e.g., example1_R1.fq and example1_R2.fq).
output_results/ is the directory where all output files will be stored.
databases/ contains the required databases for the analysis, including:
- db/: The VirSorter2 database.
- ViralVerify/: The ViralVerify database.
- checkv-db-v1.5/: The CheckV database (version 1.5).
- genomad_db/: The Genomad database.

Final Output

The processed data is organized under the specified output_path/, with the following structure:

output_path/Summary: Contains the final results and summaries for all processed samples, organized into the following subdirectories:
- SeperateRes: Holds individual directories for each sample (e.g., example1, example2):
  - <sample>_viralseqs.fasta: The list of predicted viral contigs for the respective sample.
- vOTU/: Contains the final processed viral OTU (vOTU) results across all samples:
  - vOTU.fasta: The final dereplicated viral contigs after clustering from all samples.
  - vOTU.Abundance.csv: Abundance data of the vOTUs across samples.
  - vOTU.Taxonomy.csv: Taxonomic assignments for the vOTUs, if available.
  - CheckVRes: Summarized CheckV quality assessments for final vOTUs file.
- DRAMRes (Optional): Optional functional annotations from DRAM if the advanced analysis stage is executed.
  - DRAM_annotations.tsv: Aggregated DRAM annotations for all predicted genes.
  - DRAM_Gene.Abundance.csv: TPM-based abundance estimates for each DRAM-predicted gene across samples.
- iPhopRes (Optional): Optional results from iPhop annotation if included in the workflow.
output_path/IntermediateFile: This directory holds intermediate files generated during the processing pipeline, such as filtered sequences and any temporary data.
databases/: Stores the necessary databases used for various stages of the analysis:
- db/: The VirSorter2 database.
- ViralVerify/: The ViralVerify database, used for viral prediction.
- checkv-db-v1.5/: The CheckV database (version 1.5) for quality control of viral sequences.
- genomad_db/: The Genomad database for viral identification and dereplication.

Acknowledgement

ViOTUcluster integrates state-of-the-art viromics analysis tools. The main tools within ViOTUcluster are listed below.

fastp: Online Publication

Shifu Chen. 2023. Ultrafast one-pass FASTQ data preprocessing, quality control, and deduplication using fastp. iMeta 2: e107.

MEGAHIT: Online Publication

MEGAHIT: An ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. Bioinformatics

SPAdes: Online Publication

Prjibelski, A., Antipov, D., Meleshko, D., Lapidus, A., & Korobeynikov, A. (2020). Using SPAdes de novo assembler. Current Protocols in Bioinformatics, 70, e102.

geNomad: Online Publication

Camargo, Antonio Pedro, Simon Roux, Frederik Schulz, Michal Babinski, Yan Xu, Bin Hu, Patrick SG Chain, Stephen Nayfach, and Nikos C. Kyrpides. "Identification of mobile genetic elements with geNomad." Nature Biotechnology (2023): 1-10.

viralVerify: Online Publication

Dmitry Antipov, Mikhail Raiko, Alla Lapidus, Pavel A Pevzner, MetaviralSPAdes: assembly of viruses from metagenomic data, Bioinformatics, Volume 36, Issue 14, July 2020, Pages 4126–4129

VirSorter2: Online Publication

Guo, Jiarong, Ben Bolduc, Ahmed A. Zayed, Arvind Varsani, Guillermo Dominguez-Huerta, Tom O. Delmont, Akbar Adjie Pratama et al. "VirSorter2: a multi-classifier, expert-guided approach to detect diverse DNA and RNA viruses." Microbiome 9 (2021): 1-13.

PyHMMER: Online Publication

Martin Larralde, Georg Zeller, PyHMMER: a Python library binding to HMMER for efficient sequence analysis, Bioinformatics, Volume 39, Issue 5, May 2023, btad214

CheckV: Online Publication

Nayfach, S., Camargo, A.P., Schulz, F. et al. CheckV assesses the quality and completeness of metagenome-assembled viral genomes. Nat Biotechnol 39, 578–585 (2021)

vRhyme: Online Publication

Kieft, Kristopher, Alyssa Adams, Rauf Salamzade, Lindsay Kalan, and Karthik Anantharaman. "vRhyme enables binning of viral genomes from metagenomes." Nucleic Acids Research 50, no. 14 (2022): e83-e83.

dRep: Online Publication

Olm, M., Brown, C., Brooks, B. et al. dRep: a tool for fast and accurate genomic comparisons that enables improved genome recovery from metagenomes through de-replication. ISME J 11, 2864–2868 (2017)

CheckM: Online Publication

Parks DH, Imelfort M, Skennerton CT, Hugenholtz P, Tyson GW. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res. 2015 Jul;25(7):1043-55

BWA: Online Publication

Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM.

Sambamba: Online Publication

Artem Tarasov, Albert J. Vilella, Edwin Cuppen, Isaac J. Nijman, Pjotr Prins, Sambamba: fast processing of NGS alignment formats, Bioinformatics, Volume 31, Issue 12, June 2015, Pages 2032–2034

DRAM: Online Publication

Michael Shaffer, Mikayla A Borton, Bridget B McGivern, Ahmed A Zayed, Sabina Leanti La Rosa, Lindsey M Solden, Pengfei Liu, Adrienne B Narrowe, Josué Rodríguez-Ramos, Benjamin Bolduc, M Consuelo Gazitúa, Rebecca A Daly, Garrett J Smith, Dean R Vik, Phil B Pope, Matthew B Sullivan, Simon Roux, Kelly C Wrighton, DRAM for distilling microbial metabolism to automate the curation of microbiome function, Nucleic Acids Research, Volume 48, Issue 16, 18 September 2020, Pages 8883–8900

iPHoP: Online Publication

Roux, Simon, Antonio Pedro Camargo, Felipe Hernandes Coutinho, Shareef M. Dabdoub, Bas E. Dutilh, Stephen Nayfach, and Andrew Tritt. "iPHoP: an integrated machine-learning framework to maximize host prediction for metagenome-assembled virus genomes." bioRxiv (2022): 2022-07.

Contact

Feel free to contact Sihang Liu (liusihang@tongji.edu.cn or GitHub Issues) with any questions or comments!

####################################################################################################
██╗   ██╗██╗ ██████╗ ████████╗██╗   ██╗ ██████╗██╗     ██╗   ██╗███████╗████████╗███████╗██████╗ 
██║   ██║██║██╔═══██╗╚══██╔══╝██║   ██║██╔════╝██║     ██║   ██║██╔════╝╚══██╔══╝██╔════╝██╔══██╗
██║   ██║██║██║   ██║   ██║   ██║   ██║██║     ██║     ██║   ██║███████╗   ██║   █████╗  ██████╔╝
╚██╗ ██╔╝██║██║   ██║   ██║   ██║   ██║██║     ██║     ██║   ██║╚════██║   ██║   ██╔══╝  ██╔══██╗
 ╚████╔╝ ██║╚██████╔╝   ██║   ╚██████╔╝╚██████╗███████╗╚██████╔╝███████║   ██║   ███████╗██║  ██║
  ╚═══╝  ╚═╝ ╚═════╝    ╚═╝    ╚═════╝  ╚═════╝╚══════╝ ╚═════╝ ╚══════╝   ╚═╝   ╚══════╝╚═╝  ╚═╝
####################################################################################################

Copyright

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License, version 2, as published by the Free Software Foundation.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see https://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html.

Project details

Release history Release notifications | RSS feed

This version

0.5.7.1

Dec 7, 2025

0.5.7

Nov 24, 2025

0.5.6

Nov 8, 2025

0.5.5.1

Sep 30, 2025

0.5.5

Sep 21, 2025

0.5.4

Sep 8, 2025

0.5.3

Jun 17, 2025

0.5.2.2

Jun 4, 2025

0.5.2.1

Apr 8, 2025

0.5.2

Apr 8, 2025

0.5.1

Mar 31, 2025

0.4.7.1

Mar 17, 2025

0.4.7

Mar 16, 2025

0.4.6

Mar 15, 2025

0.4.5

Mar 14, 2025

0.4.4 yanked

Mar 14, 2025

0.4.3

Mar 14, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

viotucluster-0.5.7.1.tar.gz (75.5 kB view details)

Uploaded Dec 7, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

viotucluster-0.5.7.1-py3-none-any.whl (88.0 kB view details)

Uploaded Dec 7, 2025 Python 3

File details

Details for the file viotucluster-0.5.7.1.tar.gz.

File metadata

Download URL: viotucluster-0.5.7.1.tar.gz
Upload date: Dec 7, 2025
Size: 75.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.9.18

File hashes

Hashes for viotucluster-0.5.7.1.tar.gz
Algorithm	Hash digest
SHA256	`585f2a4b205c66150de5aa76b09b90915084737e2c0605801438d0cb47b1a17f`
MD5	`6ff9224d52b858d9179ba27b7dff0f6c`
BLAKE2b-256	`a4556c71bfc2aaac92f78bfaa8016f20d98d55d4d8d1ed36f562e884cbd668d6`

See more details on using hashes here.

File details

Details for the file viotucluster-0.5.7.1-py3-none-any.whl.

File metadata

Download URL: viotucluster-0.5.7.1-py3-none-any.whl
Upload date: Dec 7, 2025
Size: 88.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.9.18

File hashes

Hashes for viotucluster-0.5.7.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`eb9b51f495e467bd3c79631df8ed680071ae27d60cc93ff5108f9dcb0350c6c0`
MD5	`a3b171ca49540e8bd1f1da200120fb44`
BLAKE2b-256	`21ee038907828a8b34aa4dccb0b9e659dc9d1c5401bcc5a70d51fd6bce0d4f93`

See more details on using hashes here.

viotucluster 0.5.7.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

ViOTUcluster: A High-Speed, All-in-One Pipeline for Viromic Analysis from Metagenomic Data

Full Text & Citation

Instruction

Important updates

Prerequisites

Installation

First-Time Installation of ViOTUcluster

Install DRAM Database

Install iPhop Database

Important Notes

Updating ViOTUcluster from an Older Version

Additional Notes

How to Use

Parameters

File Structure Example

Final Output

Acknowledgement

Contact

Copyright

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes