Skip to main content

MVP: Modular Viromics Pipeline for studying viruses identified from sequencing data

Project description

MVP_Logo.pdf{width:150px;height:100px;}

MVP v.1.0: Multi-choice Viromics Pipeline

QUICK LINKS

Overview
Databases and versions
Installation
Quick start
Main output files
References
Contributing
License / Copyright

OVERVIEW

MVP stands for Multi-choice Viromics Pipeline. It is a simplified pipeline that utilizes a suite of state-of-art tools for studying viruses identified from sequencing data (and more):

  1. geNomad: identify viruses, proviruses, and plasmids in sequencing data,
  2. CheckV: assess the quality, and completeness of identified viral genomes,
  3. Clustering: a custom code for a rapid genome clustering based on pairwise ANI (also provided with CheckV),
  4. Bowtie2, minimap2, Samtools, and CoverM: calculate coverage of individual viral OTUs by read mapping,
  5. Generating coverage table: a custom code to create a vOTU table of abundance,
  6. MMseqs2 and HMMER: gene Prediction & Functional Annotation.

It is a quick and intuitive way to get a list of viral sequences and their properties that can be used for downstream analyses. The main contributors of each software used in MVP should be acknowledged (Citations and links are provided):

MVP_Workflow.pdf{width:150px;height:100px;}

INSTALLATION

Conda installation

In developement.

Installing the latest version of MVP from this repository

  • First, open your terminal and clone the MVP repository to get the latest version of the scripts
git clone https://gitlab.com/ccoclet/mvp.git
  • Next, create the corresponding conda environment and install the required packages
cd mvp
conda (or mamba) env create -f mvp_environment.yaml
  • To activate this environment, use
conda activate mvp

Notes: If you work on MacOS, you need to use mvp_mac_environment.yaml as CoverM is not available currently for MacOS systems

conda (or mamba) env create -f mvp_mac_environment.yaml
conda activate mvp

METADATA AND DIRECTORIES

To start using MVP, you will need input files and directories:

  • a working directory (often named based on your project name) where MVP will create the output files,
  • your sequencing files (.fna, .fasta, or .fa) in a directory readable by MVP (in the example below, we will use a directory called 00_ASSEMBLY_FILES),
  • your read files (.fastq, .fastq.gz) in a directory reable by MVP, if you want to run Modules 04 and 05 (read mapping and abundance table). In the example below, we will use a directory called 00_READ_FILES,
  • a metadata (.txt) with 4 columns: Sample_number (sample group number), Sample (sample name), Assemply_Path (the absolute path of your input sequencing data files), Read_Path (the absolute path of your read files .fastq or .fastq.gz) (see example below for the content of the metadata).
Sample_number Sample Assembly_Path Read_Path Variable
1 sample_name_1 path/to/00_ASSEMBLY_FILES/sample_name_1_assembly.fna path/to/00_READ_FILES/sample_name_1_read.fastq.gz variable_1
2 sample_name_2 path/to/00_ASSEMBLY_FILES/sample_name_2_assembly.fna path/to/00_READ_FILES/sample_name_2_read.fastq.gz variable_1
3 sample_name_3 path/to/00_ASSEMBLY_FILES/sample_name_3_assembly.fasta path/to/00_READ_FILES/sample_name_3_read_R1.fastq.gz variable_2
3 sample_name_4 path/to/00_ASSEMBLY_FILES/sample_name_4_assembly.fasta path/to/00_READ_FILES/sample_name_4_read_R1.fastq.gz variable_2
3 sample_name_5 path/to/00_ASSEMBLY_FILES/sample_name_5_assembly.fasta path/to/00_READ_FILES/sample_name_5_read_R1.fastq.gz variable_2
4 sample_name_6 path/to/00_ASSEMBLY_FILES/sample_name_6_assembly.fa path/to/00_READ_FILES/sample_name_6_read.fastq.gz variable_2
5 sample_name_7 path/to/00_ASSEMBLY_FILES/sample_name_7_assembly.fa path/to/00_READ_FILES/sample_name_7_read.fastq.gz variable_3

Notes:
(1) If you run MVP for multiple projects and want to keep track of the input/output, we encourage you to save your metadata files and/or your sequencing and read files in directories such as 00_ASSEMBLY_FILES, and 00_READ_FILES in the working directory.
(2) If you have forward and reverse reads split into R1 and R2 files, you can just provide the forward R1 read file path in the column Read_Path of your metadata. If not already present, be sure to include R1 and R2 in the name of your forward R1 and reverse R2 read files. The script will find the path of the reverse R2 read file by itself if it is identical to the R1 file with just R1 changed to R2.

RUNNING THE PIPELINE - NORMAL MODE

Executing Module 00 (MVP Setup)

MVP_00_set_up_MVP -i WORKING_DIRECTORY/ -m data/example_metadata.txt --install-databases

Flag explanations

  • -i (required) Path to the working directory where MVP will be run.
  • -m (required) Path to the metadata that will be use to run MVP.
  • --install_databases argument to use if you haven't already installed geNomad and CheckV databases, or want to reinstall them (turned off by default).
  • --genomad_db_path Path to the directory where geNomad database will be installed.
  • --checkv_db_path Path to the directory where CheckV database will be installed.
  • --skip_check_errors argument to use if you want to skip to run sequence data error checking (turned off by default).

Module and output explanations

This module will first check for any potential errors/issues in your metadata and your sequencing/read files:

  • missing or extra columns, wrong column names,
  • empty cells,
  • wrong (i.e. unreadable) paths to your sequencing or read files,
  • any potential sequence errors (any characters different than A, C, T, G, or N),
  • duplicate hearders.

Once your metadata has been checked, the module will create all the directories that MVP needs: 01_GENOMAD, 02_CHECK_V, 03_CLUSTERING, 04_READ_MAPPING, 05_VOTU_TABLES and 06_FUNCTIONAL_ANNOTATION.
Finally, if you use the --install_databases argument, the module will install the latest versions of geNomad and checkV databases in 01_GENOMAD and 02_CHECK_V, or in the paths provided by --genomad_db_path and/or --checkv_db_path, respectively. MVP will skip these steps if the databases already exist.

Notes:
(1) We highly encourage you to make sure that the geNomad and checkV databases are available on your system before deciding if you want to use --install_databases or not. If you decide to not use the argument while the databases are not installed, you won't have any errors at this step, however the Module 01 won't work.

Executing Module 01 (Running geNomad and CheckV)

MVP_01_run_genomad_checkv -h  
MVP_01_run_genomad_checkv -i WORKING_DIRECTORY/ -m data/example_metadata.txt

Flag explanations

  • -i (required) path to your working directory where all the MVP outputs will be generated and stored.
  • -m (required) your metadata file.
  • --sample_group specify one number or a list of numbers in the metadata file you want to process (by default, MVP will process all datasets listed in the metadata file one after the other).
  • --modify_headers if you want to modify the name of each sequence by prefixing with the sample name (default = false).
  • --min_seq_size if you want to filter each input file based on a minimum sequence length (in bp, default = 0).
  • --genomad_relaxed or --genomad_conservative relaxed and conservative post-classification filters to disable the post-classification filters or to be very conservative with your classification, respectively, preventing sequences without strong support from being classified as plasmid or virus.
  • --genomad_db_path and --checkv_db_path these optional arguments are used to provide the paths to geNomad and CheckV, respectively, if they were not installed with the script 00_set_up_mvp.py and are not located into 01_GENOMAD and 02_CHECK_V. If you used 00_set_up_mvp.py, then MVP will find the databases by itself and you don't need to specify anything.
  • --force_genomad and --force_checkv arguments to force geNomad and/or CheckV execution even if the directories already exists (default = do not overwrite existing directories).
  • --threads set number of threads (default = 1).

Module and output explanations

This module will create sample directory, run two rounds of geNomad and CheckV on each sample and return results inside four directories into respective sample directory in 01_GENOMAD and 02_CHECK_V:

  • <sample_name>_Viruses_Genomad_Output,
  • <sample_name>_Viruses_CheckV_Output,
  • <sample_name>_Proviruses_Genomad_Output,
  • <sample_name>_Proviruses_CheckV_Output.

The <sample_name>_virus_summary.tsv, proviruses_virus_summary.tsv, and both quality_summary.tsv for viruses and proviruses tabular files summarize the results that were generated by the two rounds of geNomad and CheckV. These 4 files will be used to run Module 02.

The second round of geNomad and CheckV is used to properly process proviruses trimmed by CheckV by computing a geNomad annotation and score only on the predicted provirus (instead of including the host contamination flagged by CheckV) and predict completeness with CheckV on this trimmed provirus. If none of the contigs was trimmed by the first round of CheckV, proviruses.fna file will be empty and empty files will be created for the second round of geNomad and CheckV.

Summary reports <sample_name>_MVP_01_Summary_Report.txt are also generated and stored in the respective <sample_name> directories in 01_GENOMAD.

Notes:
(1) This module will first create a new directory called 00_MODIFIED_ASSEMBLY_FILES if you used the --modify_headers or --min_seq_size arguments, to store the modified assembly files.
(2) We highly encourage using --modify_headers if some of your contigs across the different FASTA files may have identical names, if your FASTA files come from various analyses, or if the headers of your sequences do not have a consistent format.
(2) MVP will skip geNomad and/or CheckV steps if the directories already exist and are not empty, unless you use --force_genomad and --force_checkv arguments.
(3) You can check the Github repositories of geNomad and CheckV for more explanation on argument and output files.

Executing Module 02 (Filtering viral prediction)

MVP_02_filter_genomad_checkv -h
MVP_02_filter_genomad_checkv -i WORKING_DIRECTORY/ -m data/example_metadata.txt

Flag explanations

  • -i (required) path to your working directory where all the MVP outputs will be generated and stored.
  • -m (required) your metadata file.
  • --sample_group specify one number or a list of numbers in the metadata file you want to process (by default, MVP will process all datasets listed in the metadata file one after the other).
  • --viral_min_genes minimum number of viral genes required to consider a virus prediction (based on CheckV annotation, default = 1).
  • --host_viral_genes_ratio maximum ratio of host genes to viral genes required to consider a virus prediction (based on CheckV annotation, default = 1, i.e. no more host genes than viral genes).

Module and output explanations

This module will merge and filter <sample_name>_virus_summary.tsv, proviruses_virus_summary.tsv, and both quality_summary.tsv for viruses and proviruses tabular files in the respective <sample_name>_Viruses_CheckV_Output. It will return to a single tabular file for each sample called <sample_name>_Genomad_CheckV_Virus_Proviruses_Quality_Summary.tsv. This tabular file lists all the viruses and proviruses that geNomad predicted and that passed the cutoffs, and gives you all geNomad and CheckV features (i.e. virus length, viral genes, completeness, taxonomy, etc.).

Then the module will create concatenated FASTA files for each sample of the identified virus and trimmed proviruses sequences called MVP_02_<sample_name>_viruses_proviruses_Sequences.fna.

Summary reports <sample_name>_MVP_02_Summary_Report.txt are also generated and stored in the respective <sample_name> directories in 02_CHECK_V.

Notes:
(1) You can choose to modify the --viral_min_genes and --host_viral_genes_ratio arguments at this step to filter your list of identified virus and proviruses sequences based on the number of viral genes. However, we recommend to keep these arguments at their default value in order to keep a comprehensive list of potential virus and proviruses at this step. You will be able to further filter your final tabular file using these arguments when you will run the Module 05.

Executing Module 03 (Clustering)

MVP_03_do_clustering -h
MVP_03_do_clustering -i WORKING_DIRECTORY/ -m data/example_metadata.txt

Flag explanations

  • -i (required) path to your working directory where all the MVP outputs will be generated and stored.
  • -m (required) your metadata file.
  • --min_ani minimum ANI (Average Nucleotide Identity) value for clustering (default = 95).
  • --min_tcov minimum coverage (Aligned Fraction, or AF) of the target sequence (default = 85).
  • --min_qcov minimum coverage (Aligned Fraction, or AF) of the query sequence (default = 0).
  • --read_type sequencing data type (e.g. short vs long reads) (default = short).
  • --Unfiltered_protein_file create protein FASTA file from unfiltered virus sequence. Default = False. Warning = If argument provided, the script might run for a long period of time.
  • --threads set number of threads (default = 1).

Module and output explanations

This module will first merge all the <sample_name>_viruses_proviruses.fna FASTA files and all <sample_name>_virus_summary.tsv generated by Module 02. It will return to a single FASTA file MVP_03_All_Sample_Filtered_Relaxed_Virus_Provirus_Sequences.fna and 2 tables MVP_03_All_Sample_Unfiltered_Merged_Genomad_CheckV_Virus_Proviruses_Quality_Summary.tsv and MVP_03_All_Sample_Filtered_Relaxed_Merged_Genomad_CheckV_Virus_Proviruses_Quality_Summary.tsv in 03_CLUSTERING.

Then, the module will use these newly generated FASTA and table files to do a rapid genome clustering based on pairwise ANI, and return to 1 FASTA file and 3 tabular files:

  • MVP_03_All_Sample_Genomad_CheckV_Virus_Representative_Sequences.fna: a FASTA file containing all the representative sequences.
  • MVP_03_All_Sample_Filtered_Relaxed_Merged_Genomad_CheckV_Representative_Virus_Proviruses_Quality_Summary.tsv: Results of a greedy clustering, using by default the MIUVIG recommended-parameters (95% ANI + 85% AF), along with all geNomad and CheckV information for each sequence/cluster.
  • MVP_03_All_Sample_Filtered_Relaxed_Virus_Provirus_Sequences_Clustering.tsv: all-vs-all blastn of sequences.
  • MVP_03_All_Sample_Filtered_Relaxed_Virus_Provirus_Sequences_Clustering_ANI.tsv: pairwise ANI between sequence pairs.
  • MVP_03_All_Sample_Filtered_Relaxed_Virus_Provirus_Sequences_Clustering_ANI_Clusters.tsv: Results of a greedy clustering, using by default the MIUVIG recommended-parameters (95% ANI + 85% AF).

The module will generate a MVP_03_summary_report.txt file providing an overview of key statistics, quality assessments, and taxonomy distributions.

The module will build an index using bowtie2-build (short reads) or minimap2 (long reads) from your FASTA file containing the representative sequences in the 04_READ_MAPPING directory (in preparation of Module 04).

Finally, the module will create 2 FASTA files containing protein sequences for both representative viruses and all viruses, respectively and geNomad functional annotation tables in 06_FUNCTIONAL_ANNOTATION.

Notes:
(1) MVP_03_summary_report.txt will be completed with read mapping information in Module 05.
(2) bowtie2-build outputs a set of 6 files with suffixes <reference>.1.bt2, <reference>.2.bt2, <reference>.3.bt2, <reference>.4.bt2, <reference>.rev.1.bt2, and <reference>.rev.2.bt2 in 04_READ_MAPPING. These files together constitute the index: they are all that is needed to align reads to that reference in Module 04.
*(3) If you work with long read data sequencing (e.g. PacBio, Oxford Nanpore, Illumina Complete Long reads), we recommand to use the argument --read-type long. MVP will then use minimap2 to build the index (reference) needed in Module 04. It will return to a single file called reference.mmi in 04_READ_MAPPING.

Executing Module 04 (Read mapping)

MVP_04_do_read_mapping -h
MVP_04_do_read_mapping -i WORKING_DIRECTORY/ -m data/example_metadata.txt --delete-files

Flag explanations

  • -i (required) path to your working directory where all the MVP outputs will be generated and stored.
  • -m (required) your metadata file.
  • --sample_group specify one number or a list of numbers in the metadata file you want to process (by default, MVP will process all datasets listed in the metadata file one after the other).
  • --force_read_mapping argument to force read mapping execution even if the directory already exists (default = do not overwrite existing directory).
  • --read_type sequencing data type (e.g. short vs long reads) (default = short).
  • --interleaved argument use Bowtie2 for interleaved reads (default = TRUE).
  • --delete_files if you want MVP to delete all intermediary files once a sample is done.
  • --threads set number of threads (default = 1).

Module and output explanations

This module will generate <sample_name>.sam by using bowtie2 (short reads) or minimap2 (long reads), and <sample_name>.bam, and <sample_name>_sorted.bam files by using bowtie2 to map reads from individual samples to the vOTU database generated in Module 03. Then, the module will use CoverM to calculate coverage based on read mapping, using the sorted BAM files sorted by reference, and return to one tabular file per sample : <sample_name>MVP_04_CoverM.csv containing the different coverage measures for each viral sequences. Finally, a summary report will be generated: <sample_name>MVP_04_Summary_Report.txt.

Notes:
(1) You can check the Github repositories of Bowtie2, minimap2, Samtools, and CoverM for more explanation on argument and output files.

Executing Module 05 (Creating vOTU tables)

MVP_05_create_vOTU_table -h
MVP_05_create_vOTU_table -i WORKING_DIRECTORY/ -m data/example_metadata.txt

Flag explanations

  • -i (required) path to your working directory where all the MVP outputs will be generated and stored.
  • -m (required) your metadata file.
  • --covered_fraction minimum horizontal coverage fraction required to consider a coverage in the abundance tables. By default, MVP will output abundance tables with minimum covered fraction of 0.1, 0.5, and 0.9.
  • --normalization which CoverM coverage metrics (RPKM or FPKM) you want for your abundance tables (default = RPKM).
  • --filtration inclusion criteria you want to apply on the predicted viruses for your abundance tables. Can be relaxed or conservative (default = conservative). These two levels are pre-defined combinations of minimum completeness, minimum viral genes, minimum length, and/or maximum host viral genes ratio (see below for a detailed explanation of each). Alternatively, custom cutoffs combinations can be applied using individual arguments below.
  • --viral_min_genes minimum number of viral genes required to include a virus prediction (based on CheckV annotation, default = 1).
  • --viral_min_length minimum length required to include a virus prediction (default = 0).
  • --host_viral_genes_ratio maximum ratio of host genes to viral genes required to include a virus prediction (default = 1).

Module and output explanations

This module will merge all the <sample_name>_CoverM.csv tabular files to create an unfiltered viral OTU table and save it as Unfiltered_RPKM_vOTU_table in 05_VOTU_TABLES. Then, the module will create a set of viral OTU tables based on the cutoffs (i.e., horizontal coverage) and filtration mode (i.e., conservative and relaxed) you choose. Finally, the module will complete the MVP_05_summary_report.txt generated in Module 03 with an overview of normalized abundance measures for vOTUs.

--filtration argument:
conservative will apply the following cutoffs: only include viral sequences predicted as ≥50% complete by CheckV (AAI prediction) or viral sequences ≥5kb.
relaxed will include all viral sequences that were gathered and clustered in Module 03.

Notes:
(1) We strongly encourage to run the Module 05 with --filtration conservative to filter your vOTU tables. The conservative mode will only keep Complete, High- and Medium-quality, and ≥5kb Low-quality vOTUs, and is usually a good default for most viral ecogenomics studies.
(2) The last two argument (--viral-min-genes and --host-viral-genes-ratio) can also be used when you run Module 02. They are provided in Module 05 to allow the users to run only one round of clustering (Module 03) and read mapping (Module 04), and then test different cutoffs in Module 05 to evaluate their impacts on the final results.

Executing Module 06 (Functional prediction)

MVP_06_do_functional_annotation -h
MVP_06_do_functional_annotation -i WORKING_DIRECTORY/ -m data/example_metadata.txt

Flag explanations

  • -i (required) path to your working directory where all the MVP outputs will be generated and stored.
  • -m (required) your metadata file.
  • --fasta_files Sequence and protein FASTA files (representative or all sequences) to use for functional annotation (Default = representative').
  • --PHROGS_evalue Significance e-value of match between target sequences and query (default = 0.01).
  • --PHROGS_score Score of match between target sequences and query (default = 60).
  • --PFAM_evalue Significance e-value of match between target sequences and query (default = 0.01).
  • --PFAM_score Score of match between target sequences and query (default = 50).
  • --ADS if you want MVP to searchAnti-Defense Systems.
  • --ADS_evalue Significance e-value of match between target sequences and query (default = 0.01).
  • --ADS_score Score of match between target sequences and query (default = 60).
  • --ADS_seqid Sequence identity of match between target sequences and query (default = 30).
  • --RdRP if you want MVP to create the 07_RDRP_PHYLOGENY folder and search RdRP profiles.
  • --RdRP_evalue Significance e-value of match between target sequences and query (default = 0.01).
  • --RdRP_score Score of match between target sequences and query (default = 50).
  • --DRAM if you want MVP to create an input file to be process through DRAM-v.
  • --force_prodigal force execution of protein prediction by Prodigal.
  • --force_PHROGS force PHROGS annotation.
  • --force_PFAM force PFAM annotation.
  • --force_ADS force ADS annotation.
  • --force_RdRP force RdRP annotation.
  • --force_outputs force creation of final annotation table even though it exists.
  • --delete_files if you want MVP to delete all intermediary files once the functional annotation is done.
  • --threads set number of threads (default = 1).

Module and output explanations

This module will use one of the FASTA files (i.e. unfiltered, filtered relaxed contigs, or filtered relaxed vOTUs) containing protein sequences generated by the Module 03 to search protein sequences against multiple databases (i.e. PHROGS, PFAM, Anti-defense systems, and/or RdRP) and return to unfiltered annotation tables in 06_FUNCTIONAL_ANNOTATION, for each annotation. Then, the module will filter all of these tables using score, e-value, and sequence identity thresholds and merge all tables in a a single one will all functional annotation: MVP_06_All_Sample_Filtered_Relaxed_Merged_Genomad_CheckV_Representative_Virus_Proviruses_Gene_Annotation_GENOMAD_PHROGS_PFAM_ADS_RDRP_Filtered.tsv.

If you use argument --RDRP, the module will create a new folder 06_RDRP_ANNOTATION, search the protein sequences against RdRP HMM profiles, and return to 06C_Filtered_Formatted_RdRP_Profile_Tab.tsv that can be used for downstream RdRP phylogeny analyses, for example.

If you use --DRAM, the module will create a new folder 06_DRAM_V, and generate an input table MVP_06_All_Sample_Filtered_Relaxed_Merged_Genomad_CheckV_Representative_Virus_Proviruses_Gene_Annotation_GENOMAD_DRAM_Input.tsv that can be used to run DRAM-v.

Finally, the module will create a MVP_06_summary_report.txt generated with an overview of the functional annotation.

Notes:
(1) We encourage to run the Module 06 with --fasta_files representative (by default).

Executing Module 07 (Binning viral genomes)

MVP_07_do_binning -h
MVP_07_do_binning -i WORKING_DIRECTORY/ -m data/example_metadata.txt

Flag explanations

  • -i (required) path to your working directory where all the MVP outputs will be generated and stored.
  • -m (required) your metadata file.
  • --binning_sample_group specify one number or a list of numbers in the metadata file you want to process by vRhyme for binning (by default, MVP will process all datasets listed in the metadata file one after the other).
  • --read_mapping_sample_group specify one number or a list of numbers in the metadata file you want to process for the read mapping step of the vBins (by default, MVP will process all datasets listed in the metadata file one after the other).
  • --keep_bam If used, do not remove sorted BAM files generated by vRhyme. Otherwise all unsorted/index BAM files generated from vRhyme read alignment or SAM conversions will be deleted (default = TRUE).
  • --force_vrhyme argument force vRhyme execution even if the directory 07A_vRHYME_OUTPUT already exists (default = do not overwrite existing directory).
  • --force_checkv argument force CheckV execution even if the directory 07B_vBINS_CHECKV already exists (default = do not overwrite existing directory).
  • --force_read_mapping argument force read mapping step even if the directory 07C_vBINS_READ_MAPPING already exists (default = do not overwrite existing directory).
  • --read_type sequencing data type (e.g. short vs long reads) (default = short).
  • --interleaved argument use Bowtie2 for interleaved reads (default = TRUE).
  • --delete_files if you want MVP to delete all intermediary files once a sample is done.
  • --force_outputs argument force the creation of ouptut tables even if the tables already exist (default = do not overwrite existing files).
  • --filtration relaxed or conservative vRhyme post-filtration to remove potential contaminated vBins (default = conservative).
  • --threads set number of threads (default = 1).

Module and output explanations

This module will use the FASTA file containing viral sequences generated by the Module 03 to run vRhyme for binning virus genomes and return outputs in . The module will generate a folder vRhyme_best_bins_fasta containing fasta files for best vbins and Merged_vRhyme_Outputs_Unfiltered_best_vBins_Memberships_geNomad_CheckV_Summary_Table.tsv that summarize vRhyme output tables in 07A_vRHYME_OUTPUT.
Then, the module will run CheckV and do the read mapping on vBin sequences and return results in 07B_vBINS_CHECKV and 07C_vBINS_READ_MAPPING, respectively. This will generate two output files: Unfiltered_best_vBins_read_mapping_information_Table.tsv and Merged_vRhyme_Outputs_Unfiltered_best_vBins_Memberships_geNomad_CheckV_Summary_read_mapping_information_Table.tsv.
The module will then generate three filtered coverage tables based on the same pattern than the Module 05 (i.e., horizontal coverage) in 07D_vBINS_vOTUS_TABLES.
Finally, the module will generate two fasta files in 07E_FASTA_IPHOP_INPUTS: Filtered_filtration_Prokaryote_Unknown_best_vBins_Representative_Unbinned_vOTUs_Sequences_iPHoP_Input.fasta.tsv and Filtered_Prokaryote_Only_best_vBins_Representative_Unbinned_vOTUs_Sequences_iPHoP_Input.fasta that combined vBin and unbinned sequences that are predicted to be phages and unknown or only phages, respectively.
Finally, the module will create a MVP_07_Summary_Report.txt generated with an overview of the functional annotation.

Notes:
(1) The coverage tables will combine filtered vBins and unbinned contigs.
(2) All filtration thresholds (checkV and horizontal coverage) are based on previous modules in order to get consistent outputs.

Executing Module 99 (Prepare NCBI MIUViG submission)

MVP_99_prep_MIUViG_submission -h
MVP_99_prep_MIUViG_submission -i WORKING_DIRECTORY/ -m data/example_metadata.txt

Flag explanations

  • -i (required) path to your working directory where all the MVP outputs will be generated and stored.
  • -m (required) your metadata file.
  • -g Identifier of the sequence to be processed.
  • -s Should be one of "setup_metadata" (to be run first) or "prep_submission" (once sequence metadata have been checked and completed).
  • -t path to the BioSample submission template file, generated from https://submit.ncbi.nlm.nih.gov/genbank/template/submission/, only required for the step 2: prep_submission

Module and output explanations

Executing Module 100 (Summarize outputs)

MVP_100_summarize_outputs -h
MVP_100_summarize_outputs -i WORKING_DIRECTORY/ -m data/example_metadata.txt

Flag explanations

  • -i (required) path to your working directory where all the MVP outputs will be generated and stored.
  • -m (required) your metadata file.
  • --force force creation of summary reports and figures even though they exist.

Module and output explanations

This module will first find the highest module used by the user, calculate the total MVP running time, and create a final summary report containing all the MVP commands used. Then, the module will find and store the main output tables in 100_SUMMARIZED_OUTPUTS. Finally, the module will execute the corresponding R code based on the highest MVP module ran by the user. This script essentially automates the process of summarizing outputs from the MVP, including generating report, copying final output files, and integrating R-based data analysis and visualization.

Notes:
(1) This module creates a folder within 100_SUMMARIZED_OUTPUTS named with the current date and time. This setup enables users to run the same modules with different arguments or options, or to add new modules. They can then execute Module 100, and this won't delete previous summary folders.

LINK TO GITHUB AND PEER-REVEW PAPERS

Link to geNomad github
Camargo, A.P., Roux, S., Schulz, F., Babinski, M., Xu, Y., Hu, B., Chain, P. S. G., Nayfach, S., & Kyrpides, N. C. You can move, but you can’t hide: identification of mobile genetic elements with geNomad. bioRxiv (2023), DOI: 10.1101/2020.11.01.361691

Link to CheckV github
Nayfach, S., Camargo, A.P, Schulz, F. et al. CheckV assesses the quality and completeness of metagenome-assembled viral genomes. Nat Biotechnol 39, 578–585 (2021). https://doi.org/10.1038/s41587-020-00774-7

Link to Bowtie2 github
Langmead B, Salzberg S. Fast gapped-read alignment with Bowtie 2. Nature Methods. 2012, 9:357-359

Link to minimap2
Li H., New strategies to improve minimap2 alignment accuracy. Bioinformatics. 2021, 37(23):4572–4574.

Link to Samtools github
Li, H., B. Handsaker, A. Wysoker, T. Fennell, J. Ruan, N. Homer, G. Marth, G. Abecasis, R. Durbin, and 1000 Genome Project Data Processing Subgroup. The Sequence Alignment/Map Format and SAMtools. Bioinformatics 25, no. 16 2009: 2078–79.

Link to CoverM github

Link to MMseqs2 github
Mirdita M, Steinegger M, Breitwieser F, Soding J, Levy Karin E. Fast and sensitive taxonomic assignment to metagenomic contigs. Bioinformatics, doi: 10.1093/bioinformatics/btab184 (2021).

Link to HMMER github
Finn, R.D., Clements, J., & Eddy, S. R. HMMER web server: interactive sequence similarity searching. Nucleic acids research, 39(suppl_2), W29-W37 (2011).

COPYRIGHT NOTICE

Multi-choice Viromics Pipeline (MVP) Copyright (c) 2023, The Regents of the University of California, through Lawrence Berkeley National Laboratory (subject to receipt of any required approvals from the U.S. Dept. of Energy). All rights reserved.

If you have questions about your rights to use or distribute this software, please contact Berkeley Lab's Intellectual Property Office at IPO@lbl.gov.

NOTICE. This Software was developed under funding from the U.S. Department of Energy and the U.S. Government consequently retains certain rights. As such, the U.S. Government has been granted for itself and others acting on its behalf a paid-up, nonexclusive, irrevocable, worldwide license in the Software to reproduce, distribute copies to the public, prepare derivative works, and perform publicly and display publicly, and to permit others to do so.

LICENCE AGREEMENT

GPL v3 License

Multi-choice Viromics Pipeline (MVP) Copyright (c) 2023, The Regents of the University of California, through Lawrence Berkeley National Laboratory (subject to receipt of any required approvals from the U.S. Dept. of Energy). All rights reserved.

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mvip-1.0.0.tar.gz (366.6 kB view details)

Uploaded Source

Built Distribution

mvip-1.0.0-py2.py3-none-any.whl (108.1 kB view details)

Uploaded Python 2Python 3

File details

Details for the file mvip-1.0.0.tar.gz.

File metadata

  • Download URL: mvip-1.0.0.tar.gz
  • Upload date:
  • Size: 366.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: python-requests/2.31.0

File hashes

Hashes for mvip-1.0.0.tar.gz
Algorithm Hash digest
SHA256 f46d97de0916faa6f747d611160df7bb33219337cb9b01de5608b35421aa2b53
MD5 2be8bd58e799b4a9fc116c458c45e86f
BLAKE2b-256 7944b885459341705da9c8bd27f1b03be778eb332d546987b4e03637ac475e02

See more details on using hashes here.

File details

Details for the file mvip-1.0.0-py2.py3-none-any.whl.

File metadata

  • Download URL: mvip-1.0.0-py2.py3-none-any.whl
  • Upload date:
  • Size: 108.1 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: python-requests/2.31.0

File hashes

Hashes for mvip-1.0.0-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 c8fde71ac6e6960717cd9ac7d678ea0edfa27b92c9a8c3dbdf60cf267a54a04c
MD5 f7c2ca1ed510a18907d8b1dafb292c81
BLAKE2b-256 9e630b7cb44496958eb493c377ea7f73fee9e997e97087804f2e1811c4212979

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page