MVP: Modular Viromics Pipeline for studying viruses identified from sequencing data
Project description
{width:150px;height:100px;}
MVP v.1.1.5: Modular Viromics Pipeline
INTRODUCTION
MVP stands for Modular Viromics Pipeline. It is a simplified pipeline that utilizes a suite of state-of-art tools for studying viruses identified from sequencing data (and more). It is a quick and intuitive way to get a list of viral sequences and their properties that can be used for downstream analyses. The main contributors of each software used in MVP should be acknowledged (Citations and links are provided).
PIPELINE OVERVIEW
The MVP pipeline takes (meta)-genomic assemblies, reads, and a metadata as inputs, and performs the following steps:
- Identify viruses, proviruses, and plasmids: geNomad
- Quality assessment and filtering: CheckV
- ANI clustering: provided with CheckV
- Abundance/coverage estimation: Bowtie2, minimap2, Samtools, and CoverM
- Functional annotation: prodigal, MMseqs2, HMMER, DIAMOND
- Binning: vRhyme
- NCBI MIUViG preparation for submission
- Summarize outputs: R
{width:150px;height:100px;}
INSTALLATION
Conda installation
conda create -n mvip -c conda-forge -c bioconda mvip
conda activate mvip
mvip -h
Installing the latest version of MVP from this repository
- First, open your terminal and clone the MVP repository to get the latest version of the scripts
git clone https://gitlab.com/ccoclet/mvp.git
- Next, create the corresponding conda environment and install the required packages
cd mvp
mamba (or conda) env create -f mvp_environment.yaml
- To activate this environment, use
mamba activate mvp
flit publish
Docker installation
You can install MVP using Docker image from DockerHub. We thank @taavipall for providing the Docker image.
docker pull taavipall/mvp
Notes:
(1) We recommend to use mamba rather than conda to install MVP for a quicker and more reliable installation process.
(2) If you work on MacOS, you need to use mvp_mac_environment.yaml as CoverM is not available currently for MacOS systems
mamba (or conda) env create -f mvp_mac_environment.yaml
mamba activate mvp
(3) If not already done, you might need to install flit.
(python3 -m pip install flit)
METADATA AND DIRECTORIES
To start using MVP, you will need input files and directories:
- a working directory (often named based on your project name) where MVP will create the output files,
- your sequencing files (.fna, .fasta, or .fa) in a directory readable by MVP (in the example below, we will use a directory called 00_ASSEMBLY_FILES),
- your read files (.fastq, .fastq.gz) in a directory reable by MVP, if you want to run Modules 04 and 05 (read mapping and abundance table). In the example below, we will use a directory called 00_READ_FILES,
- a metadata (.txt) with 4 columns: Sample_number (sample group number), Sample (sample name), Assemply_Path (the absolute path of your input sequencing data files), Read_Path (the absolute path of your read files .fastq or .fastq.gz) (see example below for the content of the metadata).
| Sample_number | Sample | Assembly_Path | Read_Path | Variable |
|---|---|---|---|---|
| 1 | sample_name_1 | path/to/00_ASSEMBLY_FILES/sample_name_1_assembly.fna | path/to/00_READ_FILES/sample_name_1_read.fastq.gz | variable_1 |
| 2 | sample_name_2 | path/to/00_ASSEMBLY_FILES/sample_name_2_assembly.fna | path/to/00_READ_FILES/sample_name_2_read.fastq.gz | variable_1 |
| 3 | sample_name_3 | path/to/00_ASSEMBLY_FILES/sample_name_3_assembly.fasta | path/to/00_READ_FILES/sample_name_3_read_R1.fastq.gz | variable_2 |
| 3 | sample_name_4 | path/to/00_ASSEMBLY_FILES/sample_name_4_assembly.fasta | path/to/00_READ_FILES/sample_name_4_read_R1.fastq.gz | variable_2 |
| 3 | sample_name_5 | path/to/00_ASSEMBLY_FILES/sample_name_5_assembly.fasta | path/to/00_READ_FILES/sample_name_5_read_R1.fastq.gz | variable_2 |
| 4 | sample_name_6 | path/to/00_ASSEMBLY_FILES/sample_name_6_assembly.fa | path/to/00_READ_FILES/sample_name_6_read.fastq.gz | variable_2 |
| 5 | sample_name_7 | path/to/00_ASSEMBLY_FILES/sample_name_7_assembly.fa | path/to/00_READ_FILES/sample_name_7_read.fastq.gz | variable_3 |
Notes:
(1) If you run MVP for multiple projects and want to keep track of the input/output, we encourage you to save your metadata files and/or your sequencing and read files in directories such as 00_ASSEMBLY_FILES, and 00_READ_FILES in the working directory.
(2) If you have forward and reverse reads split into R1 and R2 files, you can just provide the forward R1 read file path in the column Read_Path of your metadata. If not already present, be sure to include R1 and R2 in the name of your forward R1 and reverse R2 read files. The script will find the path of the reverse R2 read file by itself if it is identical to the R1 file with just R1 changed to R2.
RUNNING MVP PIPELINE
mvip -h
Executing Module 00 (MVP Setup)
mvip MVP_00_set_up_MVP -h
mvip MVP_00_set_up_MVP -i WORKING_DIRECTORY/ -m data/example_metadata.txt
Flag explanations
-i(required) Path to the working directory where MVP will be run.-m(required) Path to the metadata that will be use to run MVP.--skip_install_databasesargument to use if you haven't already installed geNomad and CheckV databases, or want to reinstall them (turned off by default).--genomad_db_pathPath to the directory where geNomad database will be installed.--checkv_db_pathPath to the directory where CheckV database will be installed.--skip_check_errorsargument to use if you want to skip to run sequence data error checking (turned off by default).
Module and output explanations
This module will first check for any potential errors/issues in your metadata and your sequencing/read files:
- missing or extra columns, wrong column names,
- empty cells,
- wrong (i.e. unreadable) paths to your sequencing or read files,
- any potential sequence errors (any characters different than A, C, T, G, or N),
- duplicate hearders.
Once your metadata has been checked, the module will create all the directories that MVP needs: 00_DATABASES, 01_GENOMAD, 02_CHECK_V, 03_CLUSTERING, 04_READ_MAPPING, 05_VOTU_TABLES and 06_FUNCTIONAL_ANNOTATION.
Finally, if you use the --skip_install_databases argument, the module won't install the geNomad and checkV databases in 00_DATABASES, or in the paths provided by --genomad_db_path and/or --checkv_db_path, respectively. MVP will skip these steps if the databases already exist.
Notes:
(1) We highly encourage you to make sure that the geNomad and checkV databases are available on your system before deciding if you want to use --skip_install_databases or not. If you decide to not use the argument while the databases are not installed, you won't have any errors at this step, however the Module 01 won't work.
Executing Module 01 (Running geNomad and CheckV)
mvip MVP_01_run_genomad_checkv -h
mvip MVP_01_run_genomad_checkv -i WORKING_DIRECTORY/ -m data/example_metadata.txt
Flag explanations
-i(required) path to your working directory where all the MVP outputs will be generated and stored.-m(required) your metadata file.--sample_groupspecify one number or a list of numbers in the metadata file you want to process (by default, MVP will process all datasets listed in the metadata file one after the other).--skip_modify_headersif you want to modify the name of each sequence by prefixing with the sample name (default = false).--min_seq_sizeif you want to filter each input file based on a minimum sequence length (in bp, default = 0).--genomad_relaxedor--genomad_conservativerelaxed and conservative post-classification filters to disable the post-classification filters or to be very conservative with your classification, respectively, preventing sequences without strong support from being classified as plasmid or virus.--genomad_db_pathand--checkv_db_paththese optional arguments are used to provide the paths to geNomad and CheckV, respectively, if they were not installed with the script 00_set_up_mvp.py and are not located into 01_GENOMAD and 02_CHECK_V. If you used 00_set_up_mvp.py, then MVP will find the databases by itself and you don't need to specify anything.--force_genomadand--force_checkvarguments to force geNomad and/or CheckV execution even if the directories already exists (default = do not overwrite existing directories).--threadsset number of threads (default = 1).
Module and output explanations
This module will create sample directory, run two rounds of geNomad and CheckV on each sample and return results inside four directories into respective sample directory in 01_GENOMAD and 02_CHECK_V:
<sample_name>_Viruses_Genomad_Output,<sample_name>_Viruses_CheckV_Output,<sample_name>_Proviruses_Genomad_Output,<sample_name>_Proviruses_CheckV_Output.
The <sample_name>_virus_summary.tsv, proviruses_virus_summary.tsv, and both quality_summary.tsv for viruses and proviruses tabular files summarize the results that were generated by the two rounds of geNomad and CheckV. These 4 files will be used to run Module 02.
The second round of geNomad and CheckV is used to properly process proviruses trimmed by CheckV by computing a geNomad annotation and score only on the predicted provirus (instead of including the host contamination flagged by CheckV) and predict completeness with CheckV on this trimmed provirus. If none of the contigs was trimmed by the first round of CheckV, proviruses.fna file will be empty and empty files will be created for the second round of geNomad and CheckV.
Summary reports <sample_name>_MVP_01_Summary_Report.txt are also generated and stored in the respective <sample_name> directories in 01_GENOMAD.
Notes:
(1) This module will first create a new directory called 00_MODIFIED_ASSEMBLY_FILES if you used the --modify_headers or --min_seq_size arguments, to store the modified assembly files.
(2) We highly encourage not using --skip_modify_headers if some of your contigs across the different FASTA files may have identical names, if your FASTA files come from various analyses, or if the headers of your sequences do not have a consistent format.
(2) MVP will skip geNomad and/or CheckV steps if the directories already exist and are not empty, unless you use --force_genomad and --force_checkv arguments.
(3) You can check the Github repositories of geNomad and CheckV for more explanation on argument and output files.
Executing Module 02 (Filtering viral prediction)
mvip MVP_02_filter_genomad_checkv -h
mvip MVP_02_filter_genomad_checkv -i WORKING_DIRECTORY/ -m data/example_metadata.txt
Flag explanations
-i(required) path to your working directory where all the MVP outputs will be generated and stored.-m(required) your metadata file.--sample_groupspecify one number or a list of numbers in the metadata file you want to process (by default, MVP will process all datasets listed in the metadata file one after the other).--viral_min_genesminimum number of viral genes required to consider a virus prediction (based on CheckV annotation, default = 1).--host_viral_genes_ratiomaximum ratio of host genes to viral genes required to consider a virus prediction (based on CheckV annotation, default = 1, i.e. no more host genes than viral genes).
Module and output explanations
This module will merge and filter <sample_name>_virus_summary.tsv, proviruses_virus_summary.tsv, and both quality_summary.tsv for viruses and proviruses tabular files in the respective <sample_name>_Viruses_CheckV_Output. It will return to a single tabular file for each sample called <sample_name>_Genomad_CheckV_Virus_Proviruses_Quality_Summary.tsv. This tabular file lists all the viruses and proviruses that geNomad predicted and that passed the cutoffs, and gives you all geNomad and CheckV features (i.e. virus length, viral genes, completeness, taxonomy, etc.).
Then the module will create concatenated FASTA files for each sample of the identified virus and trimmed proviruses sequences called MVP_02_<sample_name>_viruses_proviruses_Sequences.fna.
Summary reports <sample_name>_MVP_02_Summary_Report.txt are also generated and stored in the respective <sample_name> directories in 02_CHECK_V.
Notes:
(1) You can choose to modify the --viral_min_genes and --host_viral_genes_ratio arguments at this step to filter your list of identified virus and proviruses sequences based on the number of viral genes. However, we recommend to keep these arguments at their default value in order to keep a comprehensive list of potential virus and proviruses at this step. You will be able to further filter your final tabular file using these arguments when you will run the Module 05.
Executing Module 03 (Clustering)
mvip MVP_03_do_clustering -h
mvip MVP_03_do_clustering -i WORKING_DIRECTORY/ -m data/example_metadata.txt
Flag explanations
-i(required) path to your working directory where all the MVP outputs will be generated and stored.-m(required) your metadata file.--min_animinimum ANI (Average Nucleotide Identity) value for clustering (default = 95).--min_tcovminimum coverage (Aligned Fraction, or AF) of the target sequence (default = 85).--min_qcovminimum coverage (Aligned Fraction, or AF) of the query sequence (default = 0).--read_typesequencing data type (e.g. short vs long reads) (default = short).--Unfiltered_protein_filecreate protein FASTA file from unfiltered virus sequence. Default = False. Warning = If argument provided, the script might run for a long period of time.--threadsset number of threads (default = 1).
Module and output explanations
This module will first merge all the <sample_name>_viruses_proviruses.fna FASTA files and all <sample_name>_virus_summary.tsv generated by Module 02. It will return to a single FASTA file MVP_03_All_Sample_Filtered_Relaxed_Virus_Provirus_Sequences.fna and 2 tables MVP_03_All_Sample_Unfiltered_Merged_Genomad_CheckV_Virus_Proviruses_Quality_Summary.tsv and MVP_03_All_Sample_Filtered_Relaxed_Merged_Genomad_CheckV_Virus_Proviruses_Quality_Summary.tsv in 03_CLUSTERING.
Then, the module will use these newly generated FASTA and table files to do a rapid genome clustering based on pairwise ANI, and return to 1 FASTA file and 3 tabular files:
- MVP_03_All_Sample_Genomad_CheckV_Virus_Representative_Sequences.fna: a FASTA file containing all the representative sequences.
- MVP_03_All_Sample_Filtered_Relaxed_Merged_Genomad_CheckV_Representative_Virus_Proviruses_Quality_Summary.tsv: Results of a greedy clustering, using by default the MIUVIG recommended-parameters (95% ANI + 85% AF), along with all geNomad and CheckV information for each sequence/cluster.
- MVP_03_All_Sample_Filtered_Relaxed_Virus_Provirus_Sequences_Clustering.tsv: all-vs-all blastn of sequences.
- MVP_03_All_Sample_Filtered_Relaxed_Virus_Provirus_Sequences_Clustering_ANI.tsv: pairwise ANI between sequence pairs.
- MVP_03_All_Sample_Filtered_Relaxed_Virus_Provirus_Sequences_Clustering_ANI_Clusters.tsv: Results of a greedy clustering, using by default the MIUVIG recommended-parameters (95% ANI + 85% AF).
The module will generate a MVP_03_summary_report.txt file providing an overview of key statistics, quality assessments, and taxonomy distributions.
The module will build an index using bowtie2-build (short reads) or minimap2 (long reads) from your FASTA file containing the representative sequences in the 04_READ_MAPPING directory (in preparation of Module 04).
Finally, the module will create 2 FASTA files containing protein sequences for both representative viruses and all viruses, respectively and geNomad functional annotation tables in 06_FUNCTIONAL_ANNOTATION.
Notes:
(1) MVP_03_summary_report.txt will be completed with read mapping information in Module 05.
(2) bowtie2-build outputs a set of 6 files with suffixes <reference>.1.bt2, <reference>.2.bt2, <reference>.3.bt2, <reference>.4.bt2, <reference>.rev.1.bt2, and <reference>.rev.2.bt2 in 04_READ_MAPPING. These files together constitute the index: they are all that is needed to align reads to that reference in Module 04.
*(3) If you work with long read data sequencing (e.g. PacBio, Oxford Nanpore, Illumina Complete Long reads), we recommand to use the argument --read-type long. MVP will then use minimap2 to build the index (reference) needed in Module 04. It will return to a single file called reference.mmi in 04_READ_MAPPING.
Executing Module 04 (Read mapping)
mvip MVP_04_do_read_mapping -h
mvip MVP_04_do_read_mapping -i WORKING_DIRECTORY/ -m data/example_metadata.txt --delete_files
Flag explanations
-i(required) path to your working directory where all the MVP outputs will be generated and stored.-m(required) your metadata file.--sample_groupspecify one number or a list of numbers in the metadata file you want to process (by default, MVP will process all datasets listed in the metadata file one after the other).--force_read_mappingargument to force read mapping execution even if the directory already exists (default = do not overwrite existing directory).--read_typesequencing data type (e.g. short vs long reads) (default = short).--interleavedargument use Bowtie2 for interleaved reads (default = TRUE).--delete_filesif you want MVP to delete all intermediary files once a sample is done.--threadsset number of threads (default = 1).
Module and output explanations
This module will generate <sample_name>.sam by using bowtie2 (short reads) or minimap2 (long reads), and <sample_name>.bam, and <sample_name>_sorted.bam files by using bowtie2 to map reads from individual samples to the vOTU database generated in Module 03. Then, the module will use CoverM to calculate coverage based on read mapping, using the sorted BAM files sorted by reference, and return to one tabular file per sample : <sample_name>MVP_04_CoverM.csv containing the different coverage measures for each viral sequences. Finally, a summary report will be generated: <sample_name>MVP_04_Summary_Report.txt.
Notes:
(1) You can check the Github repositories of Bowtie2, minimap2, Samtools, and CoverM for more explanation on argument and output files.
Executing Module 05 (Creating vOTU tables)
mvip MVP_05_create_vOTU_table -h
mvip MVP_05_create_vOTU_table -i WORKING_DIRECTORY/ -m data/example_metadata.txt
Flag explanations
-i(required) path to your working directory where all the MVP outputs will be generated and stored.-m(required) your metadata file.--covered_fractionminimum horizontal coverage fraction required to consider a coverage in the abundance tables. By default, MVP will output abundance tables with minimum covered fraction of 0.1, 0.5, and 0.9.--normalizationwhich CoverM coverage metrics (RPKMorFPKM) you want for your abundance tables (default = RPKM).--filtrationinclusion criteria you want to apply on the predicted viruses for your abundance tables. Can berelaxedorconservative(default = conservative). These two levels are pre-defined combinations of minimum completeness, minimum viral genes, minimum length, and/or maximum host viral genes ratio (see below for a detailed explanation of each). Alternatively, custom cutoffs combinations can be applied using individual arguments below.--viral_min_genesminimum number of viral genes required to include a virus prediction (based on CheckV annotation, default = 1).--host_viral_genes_ratiomaximum ratio of host genes to viral genes required to include a virus prediction (default = 1).
Module and output explanations
This module will merge all the <sample_name>_CoverM.csv tabular files to create an unfiltered viral OTU table and save it as Unfiltered_RPKM_vOTU_table in 05_VOTU_TABLES. Then, the module will create a set of viral OTU tables based on the cutoffs (i.e., horizontal coverage) and filtration mode (i.e., conservative and relaxed) you choose. Finally, the module will complete the MVP_05_summary_report.txt generated in Module 03 with an overview of normalized abundance measures for vOTUs.
--filtration argument:
conservative will apply the following cutoffs: only include viral sequences predicted as ≥50% complete by CheckV (AAI prediction) or viral sequences ≥5kb.
relaxed will include all viral sequences that were gathered and clustered in Module 03.
Notes:
(1) We strongly encourage to run the Module 05 with --filtration conservative to filter your vOTU tables. The conservative mode will only keep Complete, High- and Medium-quality, and ≥5kb Low-quality vOTUs, and is usually a good default for most viral ecogenomics studies.
(2) The last two argument (--viral-min-genes and --host-viral-genes-ratio) can also be used when you run Module 02. They are provided in Module 05 to allow the users to run only one round of clustering (Module 03) and read mapping (Module 04), and then test different cutoffs in Module 05 to evaluate their impacts on the final results.
Executing Module 06 (Functional prediction)
mvip MVP_06_do_functional_annotation -h
mvip MVP_06_do_functional_annotation -i WORKING_DIRECTORY/ -m data/example_metadata.txt
Flag explanations
-i(required) path to your working directory where all the MVP outputs will be generated and stored.-m(required) your metadata file.--fasta_filesSequence and protein FASTA files (representative or all sequences) to use for functional annotation (Default = representative').--PHROGS_evalueSignificance e-value of match between target sequences and query (default = 0.01).--PHROGS_scoreScore of match between target sequences and query (default = 60).--PFAM_evalueSignificance e-value of match between target sequences and query (default = 0.01).--PFAM_scoreScore of match between target sequences and query (default = 50).--ADSif you want MVP to searchAnti-Defense Systems.--ADS_evalueSignificance e-value of match between target sequences and query (default = 0.01).--ADS_scoreScore of match between target sequences and query (default = 60).--ADS_seqidSequence identity of match between target sequences and query (default = 30).--RdRPif you want MVP to create the 07_RDRP_PHYLOGENY folder and search RdRP profiles.--RdRP_evalueSignificance e-value of match between target sequences and query (default = 0.01).--RdRP_scoreScore of match between target sequences and query (default = 50).--DRAMif you want MVP to create an input file to be process through DRAM-v.--force_prodigalforce execution of protein prediction by Prodigal.--force_PHROGSforce PHROGS annotation.--force_PFAMforce PFAM annotation.--force_ADSforce ADS annotation.--force_RdRPforce RdRP annotation.--force_outputsforce creation of final annotation table even though it exists.--delete_filesif you want MVP to delete all intermediary files once the functional annotation is done.--threadsset number of threads (default = 1).
Module and output explanations
This module will use one of the FASTA files (i.e. unfiltered, filtered relaxed contigs, or filtered relaxed vOTUs) containing protein sequences generated by the Module 03 to search protein sequences against multiple databases (i.e. PHROGS, PFAM, Anti-defense systems, and/or RdRP) and return to unfiltered annotation tables in 06_FUNCTIONAL_ANNOTATION, for each annotation. Then, the module will filter all of these tables using score, e-value, and sequence identity thresholds and merge all tables in a a single one will all functional annotation: MVP_06_All_Sample_Filtered_Relaxed_Merged_Genomad_CheckV_Representative_Virus_Proviruses_Gene_Annotation_GENOMAD_PHROGS_PFAM_ADS_RDRP_Filtered.tsv.
If you use argument --RDRP, the module will create a new folder 06_RDRP_ANNOTATION, search the protein sequences against RdRP HMM profiles, and return to 06C_Filtered_Formatted_RdRP_Profile_Tab.tsv that can be used for downstream RdRP phylogeny analyses, for example.
If you use --DRAM, the module will create a new folder 06_DRAM_V, and generate an input table MVP_06_All_Sample_Filtered_Relaxed_Merged_Genomad_CheckV_Representative_Virus_Proviruses_Gene_Annotation_GENOMAD_DRAM_Input.tsv that can be used to run DRAM-v.
Finally, the module will create a MVP_06_summary_report.txt generated with an overview of the functional annotation.
Notes:
(1) We encourage to run the Module 06 with --fasta_files representative (by default).
Executing Module 07 (Binning viral genomes)
mvip MVP_07_do_binning -h
mvip MVP_07_do_binning -i WORKING_DIRECTORY/ -m data/example_metadata.txt
Flag explanations
-i(required) path to your working directory where all the MVP outputs will be generated and stored.-m(required) your metadata file.--binning_sample_groupspecify one number or a list of numbers in the metadata file you want to process by vRhyme for binning (by default, MVP will process all datasets listed in the metadata file one after the other).--read_mapping_sample_groupspecify one number or a list of numbers in the metadata file you want to process for the read mapping step of the vBins (by default, MVP will process all datasets listed in the metadata file one after the other).--keep_bamIf used, do not remove sorted BAM files generated by vRhyme. Otherwise all unsorted/index BAM files generated from vRhyme read alignment or SAM conversions will be deleted (default = TRUE).--force_vrhymeargument force vRhyme execution even if the directory 07A_vRHYME_OUTPUT already exists (default = do not overwrite existing directory).--force_checkvargument force CheckV execution even if the directory 07B_vBINS_CHECKV already exists (default = do not overwrite existing directory).--force_read_mappingargument force read mapping step even if the directory 07C_vBINS_READ_MAPPING already exists (default = do not overwrite existing directory).--read_typesequencing data type (e.g. short vs long reads) (default = short).--interleavedargument use Bowtie2 for interleaved reads (default = TRUE).--delete_filesif you want MVP to delete all intermediary files once a sample is done.--force_outputsargument force the creation of ouptut tables even if the tables already exist (default = do not overwrite existing files).--filtrationrelaxed or conservative vRhyme post-filtration to remove potential contaminated vBins (default = conservative).--threadsset number of threads (default = 1).
Module and output explanations
This module will use the FASTA file containing viral sequences generated by the Module 03 to run vRhyme for binning virus genomes and return outputs in .
The module will generate a folder vRhyme_best_bins_fasta containing fasta files for best vbins and Merged_vRhyme_Outputs_Unfiltered_best_vBins_Memberships_geNomad_CheckV_Summary_Table.tsv that summarize vRhyme output tables in 07A_vRHYME_OUTPUT.
Then, the module will run CheckV and do the read mapping on vBin sequences and return results in 07B_vBINS_CHECKV and 07C_vBINS_READ_MAPPING, respectively. This will generate two output files: Unfiltered_best_vBins_read_mapping_information_Table.tsv and Merged_vRhyme_Outputs_Unfiltered_best_vBins_Memberships_geNomad_CheckV_Summary_read_mapping_information_Table.tsv.
The module will then generate three filtered coverage tables based on the same pattern than the Module 05 (i.e., horizontal coverage) in 07D_vBINS_vOTUS_TABLES.
Finally, the module will generate two fasta files in 07E_FASTA_IPHOP_INPUTS: Filtered_filtration_Prokaryote_Unknown_best_vBins_Representative_Unbinned_vOTUs_Sequences_iPHoP_Input.fasta.tsv and Filtered_Prokaryote_Only_best_vBins_Representative_Unbinned_vOTUs_Sequences_iPHoP_Input.fasta that combined vBin and unbinned sequences that are predicted to be phages and unknown or only phages, respectively.
Finally, the module will create a MVP_07_Summary_Report.txt generated with an overview of the functional annotation.
Notes:
(1) The coverage tables will combine filtered vBins and unbinned contigs.
(2) All filtration thresholds (checkV and horizontal coverage) are based on previous modules in order to get consistent outputs.
Executing Module 99 (Prepare NCBI MIUViG submission)
mvip MVP_99_prep_MIUViG_submission -h
mvip MVP_99_prep_MIUViG_submission -i WORKING_DIRECTORY/ -m data/example_metadata.txt
Flag explanations
-i(required) path to your working directory where all the MVP outputs will be generated and stored.-m(required) your metadata file.-gIdentifier of the sequence to be processed.-sShould be one of "setup_metadata" (to be run first) or "prep_submission" (once sequence metadata have been checked and completed).-tpath to the BioSample submission template file, generated from https://submit.ncbi.nlm.nih.gov/genbank/template/submission/, only required for the step 2: prep_submission
Module and output explanations
Executing Module 100 (Summarize outputs)
mvip MVP_100_summarize_outputs -h
mvip MVP_100_summarize_outputs -i WORKING_DIRECTORY/ -m data/example_metadata.txt
Flag explanations
-i(required) path to your working directory where all the MVP outputs will be generated and stored.-m(required) your metadata file.--forceforce creation of summary reports and figures even though they exist.
Module and output explanations
This module will first find the highest module used by the user, calculate the total MVP running time, and create a final summary report containing all the MVP commands used. Then, the module will find and store the main output tables in 100_SUMMARIZED_OUTPUTS. Finally, the module will execute the corresponding R code based on the highest MVP module ran by the user. This script essentially automates the process of summarizing outputs from the MVP, including generating report, copying final output files, and integrating R-based data analysis and visualization.
Notes:
(1) This module creates a folder within 100_SUMMARIZED_OUTPUTS named with the current date and time. This setup enables users to run the same modules with different arguments or options, or to add new modules. They can then execute Module 100, and this won't delete previous summary folders.
LINK TO GITHUB AND PEER-REVEW PAPERS
Link to geNomad github
Camargo, A.P., Roux, S., Schulz, F., Babinski, M., Xu, Y., Hu, B., Chain, P. S. G., Nayfach, S., & Kyrpides, N. C. You can move, but you can’t hide: identification of mobile genetic elements with geNomad. bioRxiv (2023), DOI: 10.1101/2020.11.01.361691
Link to CheckV github
Nayfach, S., Camargo, A.P, Schulz, F. et al. CheckV assesses the quality and completeness of metagenome-assembled viral genomes. Nat Biotechnol 39, 578–585 (2021). https://doi.org/10.1038/s41587-020-00774-7
Link to Bowtie2 github
Langmead B, Salzberg S. Fast gapped-read alignment with Bowtie 2. Nature Methods. 2012, 9:357-359
Link to minimap2
Li H., New strategies to improve minimap2 alignment accuracy. Bioinformatics. 2021, 37(23):4572–4574.
Link to Samtools github
Li, H., B. Handsaker, A. Wysoker, T. Fennell, J. Ruan, N. Homer, G. Marth, G. Abecasis, R. Durbin, and 1000 Genome Project Data Processing Subgroup. The Sequence Alignment/Map Format and SAMtools. Bioinformatics 25, no. 16 2009: 2078–79.
Link to CoverM github
Link to MMseqs2 github
Mirdita M, Steinegger M, Breitwieser F, Soding J, Levy Karin E. Fast and sensitive taxonomic assignment to metagenomic contigs. Bioinformatics, doi: 10.1093/bioinformatics/btab184 (2021).
Link to HMMER github
Finn, R.D., Clements, J., & Eddy, S. R. HMMER web server: interactive sequence similarity searching. Nucleic acids research, 39(suppl_2), W29-W37 (2011).
COPYRIGHT NOTICE
Modular Viromics Pipeline (MVP) Copyright (c) 2023, The Regents of the University of California, through Lawrence Berkeley National Laboratory (subject to receipt of any required approvals from the U.S. Dept. of Energy). All rights reserved.
If you have questions about your rights to use or distribute this software, please contact Berkeley Lab's Intellectual Property Office at IPO@lbl.gov.
NOTICE. This Software was developed under funding from the U.S. Department of Energy and the U.S. Government consequently retains certain rights. As such, the U.S. Government has been granted for itself and others acting on its behalf a paid-up, nonexclusive, irrevocable, worldwide license in the Software to reproduce, distribute copies to the public, prepare derivative works, and perform publicly and display publicly, and to permit others to do so.
LICENCE AGREEMENT
GPL v3 License
Modular Viromics Pipeline (MVP) Copyright (c) 2023, The Regents of the University of California, through Lawrence Berkeley National Laboratory (subject to receipt of any required approvals from the U.S. Dept. of Energy). All rights reserved.
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mvip-1.1.5.tar.gz.
File metadata
- Download URL: mvip-1.1.5.tar.gz
- Upload date:
- Size: 5.6 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: python-requests/2.32.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
61f1c5de4bd74b8453eaad5bdd8c95ad46c6b7e10a13977e3734696696785b42
|
|
| MD5 |
6fd989f8d14df934ab0d6afb6fc422c4
|
|
| BLAKE2b-256 |
2ec28eaccce14c915f590bd51c42b17df8a483b446db0e86c0f2e2682749a7a1
|
File details
Details for the file mvip-1.1.5-py2.py3-none-any.whl.
File metadata
- Download URL: mvip-1.1.5-py2.py3-none-any.whl
- Upload date:
- Size: 108.6 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: python-requests/2.32.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
df4f5deb58830dc0aefc1f2773ea323939e85f623650651e075c6b9411eb5f0b
|
|
| MD5 |
effa3bf2c456b7af6a8f594bb77828ac
|
|
| BLAKE2b-256 |
3b388b6b8bd19f774f164ae2bd51cf8c80b4ab71d094e9d1a3c0ca081b7aa5e0
|