MVP: Modular Viromics Pipeline for studying viruses identified from sequencing data
Project description
{width:150px;height:100px;}
MVP v.1.0: Multi-choice Viromics Pipeline
QUICK LINKS
Overview
Databases and versions
Installation
Quick start
Main output files
References
Contributing
License / Copyright
OVERVIEW
MVP stands for Multi-choice Viromics Pipeline. It is a simplified pipeline that utilizes a suite of state-of-art tools for studying viruses identified from sequencing data (and more):
- geNomad: identify viruses, proviruses, and plasmids in sequencing data,
- CheckV: assess the quality, and completeness of identified viral genomes,
- Clustering: a custom code for a rapid genome clustering based on pairwise ANI (also provided with CheckV),
- Bowtie2, minimap2, Samtools, and CoverM: calculate coverage of individual viral OTUs by read mapping,
- Generating coverage table: a custom code to create a vOTU table of abundance,
- MMseqs2 and HMMER: gene Prediction & Functional Annotation.
It is a quick and intuitive way to get a list of viral sequences and their properties that can be used for downstream analyses. The main contributors of each software used in MVP should be acknowledged (Citations and links are provided):
{width:150px;height:100px;}
INSTALLATION
Conda installation
In developement.
Installing the latest version of MVP from this repository
- First, open your terminal and clone the MVP repository to get the latest version of the scripts
git clone https://gitlab.com/ccoclet/mvp.git
- Next, create the corresponding conda environment and install the required packages
cd mvp
conda (or mamba) env create -f mvp_environment.yaml
- To activate this environment, use
conda activate mvp
Notes: If you work on MacOS, you need to use mvp_mac_environment.yaml as CoverM is not available currently for MacOS systems
conda (or mamba) env create -f mvp_mac_environment.yaml
conda activate mvp
METADATA AND DIRECTORIES
To start using MVP, you will need input files and directories:
- a working directory (often named based on your project name) where MVP will create the output files,
- your sequencing files (.fna, .fasta, or .fa) in a directory readable by MVP (in the example below, we will use a directory called 00_ASSEMBLY_FILES),
- your read files (.fastq, .fastq.gz) in a directory reable by MVP, if you want to run Modules 04 and 05 (read mapping and abundance table). In the example below, we will use a directory called 00_READ_FILES,
- a metadata (.txt) with 4 columns: Sample_number (sample group number), Sample (sample name), Assemply_Path (the absolute path of your input sequencing data files), Read_Path (the absolute path of your read files .fastq or .fastq.gz) (see example below for the content of the metadata).
Sample_number | Sample | Assembly_Path | Read_Path | Variable |
---|---|---|---|---|
1 | sample_name_1 | path/to/00_ASSEMBLY_FILES/sample_name_1_assembly.fna | path/to/00_READ_FILES/sample_name_1_read.fastq.gz | variable_1 |
2 | sample_name_2 | path/to/00_ASSEMBLY_FILES/sample_name_2_assembly.fna | path/to/00_READ_FILES/sample_name_2_read.fastq.gz | variable_1 |
3 | sample_name_3 | path/to/00_ASSEMBLY_FILES/sample_name_3_assembly.fasta | path/to/00_READ_FILES/sample_name_3_read_R1.fastq.gz | variable_2 |
3 | sample_name_4 | path/to/00_ASSEMBLY_FILES/sample_name_4_assembly.fasta | path/to/00_READ_FILES/sample_name_4_read_R1.fastq.gz | variable_2 |
3 | sample_name_5 | path/to/00_ASSEMBLY_FILES/sample_name_5_assembly.fasta | path/to/00_READ_FILES/sample_name_5_read_R1.fastq.gz | variable_2 |
4 | sample_name_6 | path/to/00_ASSEMBLY_FILES/sample_name_6_assembly.fa | path/to/00_READ_FILES/sample_name_6_read.fastq.gz | variable_2 |
5 | sample_name_7 | path/to/00_ASSEMBLY_FILES/sample_name_7_assembly.fa | path/to/00_READ_FILES/sample_name_7_read.fastq.gz | variable_3 |
Notes:
(1) If you run MVP for multiple projects and want to keep track of the input/output, we encourage you to save your metadata files and/or your sequencing and read files in directories such as 00_ASSEMBLY_FILES, and 00_READ_FILES in the working directory.
(2) If you have forward and reverse reads split into R1 and R2 files, you can just provide the forward R1 read file path in the column Read_Path of your metadata. If not already present, be sure to include R1
and R2
in the name of your forward R1 and reverse R2 read files. The script will find the path of the reverse R2 read file by itself if it is identical to the R1 file with just R1 changed to R2.
RUNNING THE PIPELINE - NORMAL MODE
Executing Module 00 (MVP Setup)
MVP_00_set_up_MVP -i WORKING_DIRECTORY/ -m data/example_metadata.txt --install-databases
Flag explanations
-i
(required) Path to the working directory where MVP will be run.-m
(required) Path to the metadata that will be use to run MVP.--install_databases
argument to use if you haven't already installed geNomad and CheckV databases, or want to reinstall them (turned off by default).--genomad_db_path
Path to the directory where geNomad database will be installed.--checkv_db_path
Path to the directory where CheckV database will be installed.--skip_check_errors
argument to use if you want to skip to run sequence data error checking (turned off by default).
Module and output explanations
This module will first check for any potential errors/issues in your metadata and your sequencing/read files:
- missing or extra columns, wrong column names,
- empty cells,
- wrong (i.e. unreadable) paths to your sequencing or read files,
- any potential sequence errors (any characters different than A, C, T, G, or N),
- duplicate hearders.
Once your metadata has been checked, the module will create all the directories that MVP needs: 01_GENOMAD, 02_CHECK_V, 03_CLUSTERING, 04_READ_MAPPING, 05_VOTU_TABLES and 06_FUNCTIONAL_ANNOTATION.
Finally, if you use the --install_databases
argument, the module will install the latest versions of geNomad and checkV databases in 01_GENOMAD and 02_CHECK_V, or in the paths provided by --genomad_db_path
and/or --checkv_db_path
, respectively. MVP will skip these steps if the databases already exist.
Notes:
(1) We highly encourage you to make sure that the geNomad and checkV databases are available on your system before deciding if you want to use --install_databases
or not. If you decide to not use the argument while the databases are not installed, you won't have any errors at this step, however the Module 01 won't work.
Executing Module 01 (Running geNomad and CheckV)
MVP_01_run_genomad_checkv -h
MVP_01_run_genomad_checkv -i WORKING_DIRECTORY/ -m data/example_metadata.txt
Flag explanations
-i
(required) path to your working directory where all the MVP outputs will be generated and stored.-m
(required) your metadata file.--sample_group
specify one number or a list of numbers in the metadata file you want to process (by default, MVP will process all datasets listed in the metadata file one after the other).--modify_headers
if you want to modify the name of each sequence by prefixing with the sample name (default = false).--min_seq_size
if you want to filter each input file based on a minimum sequence length (in bp, default = 0).--genomad_relaxed
or--genomad_conservative
relaxed and conservative post-classification filters to disable the post-classification filters or to be very conservative with your classification, respectively, preventing sequences without strong support from being classified as plasmid or virus.--genomad_db_path
and--checkv_db_path
these optional arguments are used to provide the paths to geNomad and CheckV, respectively, if they were not installed with the script 00_set_up_mvp.py and are not located into 01_GENOMAD and 02_CHECK_V. If you used 00_set_up_mvp.py, then MVP will find the databases by itself and you don't need to specify anything.--force_genomad
and--force_checkv
arguments to force geNomad and/or CheckV execution even if the directories already exists (default = do not overwrite existing directories).--threads
set number of threads (default = 1).
Module and output explanations
This module will create sample directory, run two rounds of geNomad and CheckV on each sample and return results inside four directories into respective sample directory in 01_GENOMAD and 02_CHECK_V:
<sample_name>
_Viruses_Genomad_Output,<sample_name>
_Viruses_CheckV_Output,<sample_name>
_Proviruses_Genomad_Output,<sample_name>
_Proviruses_CheckV_Output.
The <sample_name>
_virus_summary.tsv, proviruses_virus_summary.tsv, and both quality_summary.tsv for viruses and proviruses tabular files summarize the results that were generated by the two rounds of geNomad and CheckV. These 4 files will be used to run Module 02.
The second round of geNomad and CheckV is used to properly process proviruses trimmed by CheckV by computing a geNomad annotation and score only on the predicted provirus (instead of including the host contamination flagged by CheckV) and predict completeness with CheckV on this trimmed provirus. If none of the contigs was trimmed by the first round of CheckV, proviruses.fna file will be empty and empty files will be created for the second round of geNomad and CheckV.
Summary reports <sample_name>
_MVP_01_Summary_Report.txt are also generated and stored in the respective <sample_name>
directories in 01_GENOMAD.
Notes:
(1) This module will first create a new directory called 00_MODIFIED_ASSEMBLY_FILES if you used the --modify_headers
or --min_seq_size
arguments, to store the modified assembly files.
(2) We highly encourage using --modify_headers
if some of your contigs across the different FASTA files may have identical names, if your FASTA files come from various analyses, or if the headers of your sequences do not have a consistent format.
(2) MVP will skip geNomad and/or CheckV steps if the directories already exist and are not empty, unless you use --force_genomad
and --force_checkv
arguments.
(3) You can check the Github repositories of geNomad and CheckV for more explanation on argument and output files.
Executing Module 02 (Filtering viral prediction)
MVP_02_filter_genomad_checkv -h
MVP_02_filter_genomad_checkv -i WORKING_DIRECTORY/ -m data/example_metadata.txt
Flag explanations
-i
(required) path to your working directory where all the MVP outputs will be generated and stored.-m
(required) your metadata file.--sample_group
specify one number or a list of numbers in the metadata file you want to process (by default, MVP will process all datasets listed in the metadata file one after the other).--viral_min_genes
minimum number of viral genes required to consider a virus prediction (based on CheckV annotation, default = 1).--host_viral_genes_ratio
maximum ratio of host genes to viral genes required to consider a virus prediction (based on CheckV annotation, default = 1, i.e. no more host genes than viral genes).
Module and output explanations
This module will merge and filter <sample_name>
_virus_summary.tsv, proviruses_virus_summary.tsv, and both quality_summary.tsv for viruses and proviruses tabular files in the respective <sample_name>
_Viruses_CheckV_Output. It will return to a single tabular file for each sample called <sample_name>
_Genomad_CheckV_Virus_Proviruses_Quality_Summary.tsv. This tabular file lists all the viruses and proviruses that geNomad predicted and that passed the cutoffs, and gives you all geNomad and CheckV features (i.e. virus length, viral genes, completeness, taxonomy, etc.).
Then the module will create concatenated FASTA files for each sample of the identified virus and trimmed proviruses sequences called MVP_02_<sample_name>
_viruses_proviruses_Sequences.fna.
Summary reports <sample_name>
_MVP_02_Summary_Report.txt are also generated and stored in the respective <sample_name>
directories in 02_CHECK_V.
Notes:
(1) You can choose to modify the --viral_min_genes
and --host_viral_genes_ratio
arguments at this step to filter your list of identified virus and proviruses sequences based on the number of viral genes. However, we recommend to keep these arguments at their default value in order to keep a comprehensive list of potential virus and proviruses at this step. You will be able to further filter your final tabular file using these arguments when you will run the Module 05.
Executing Module 03 (Clustering)
MVP_03_do_clustering -h
MVP_03_do_clustering -i WORKING_DIRECTORY/ -m data/example_metadata.txt
Flag explanations
-i
(required) path to your working directory where all the MVP outputs will be generated and stored.-m
(required) your metadata file.--min_ani
minimum ANI (Average Nucleotide Identity) value for clustering (default = 95).--min_tcov
minimum coverage (Aligned Fraction, or AF) of the target sequence (default = 85).--min_qcov
minimum coverage (Aligned Fraction, or AF) of the query sequence (default = 0).--read_type
sequencing data type (e.g. short vs long reads) (default = short).--Unfiltered_protein_file
create protein FASTA file from unfiltered virus sequence. Default = False. Warning = If argument provided, the script might run for a long period of time.--threads
set number of threads (default = 1).
Module and output explanations
This module will first merge all the <sample_name>
_viruses_proviruses.fna FASTA files and all <sample_name>
_virus_summary.tsv generated by Module 02. It will return to a single FASTA file MVP_03_All_Sample_Filtered_Relaxed_Virus_Provirus_Sequences.fna and 2 tables MVP_03_All_Sample_Unfiltered_Merged_Genomad_CheckV_Virus_Proviruses_Quality_Summary.tsv and MVP_03_All_Sample_Filtered_Relaxed_Merged_Genomad_CheckV_Virus_Proviruses_Quality_Summary.tsv in 03_CLUSTERING.
Then, the module will use these newly generated FASTA and table files to do a rapid genome clustering based on pairwise ANI, and return to 1 FASTA file and 3 tabular files:
- MVP_03_All_Sample_Genomad_CheckV_Virus_Representative_Sequences.fna: a FASTA file containing all the representative sequences.
- MVP_03_All_Sample_Filtered_Relaxed_Merged_Genomad_CheckV_Representative_Virus_Proviruses_Quality_Summary.tsv: Results of a greedy clustering, using by default the MIUVIG recommended-parameters (95% ANI + 85% AF), along with all geNomad and CheckV information for each sequence/cluster.
- MVP_03_All_Sample_Filtered_Relaxed_Virus_Provirus_Sequences_Clustering.tsv: all-vs-all blastn of sequences.
- MVP_03_All_Sample_Filtered_Relaxed_Virus_Provirus_Sequences_Clustering_ANI.tsv: pairwise ANI between sequence pairs.
- MVP_03_All_Sample_Filtered_Relaxed_Virus_Provirus_Sequences_Clustering_ANI_Clusters.tsv: Results of a greedy clustering, using by default the MIUVIG recommended-parameters (95% ANI + 85% AF).
The module will generate a MVP_03_summary_report.txt file providing an overview of key statistics, quality assessments, and taxonomy distributions.
The module will build an index using bowtie2-build (short reads) or minimap2 (long reads) from your FASTA file containing the representative sequences in the 04_READ_MAPPING directory (in preparation of Module 04).
Finally, the module will create 2 FASTA files containing protein sequences for both representative viruses and all viruses, respectively and geNomad functional annotation tables in 06_FUNCTIONAL_ANNOTATION.
Notes:
(1) MVP_03_summary_report.txt will be completed with read mapping information in Module 05.
(2) bowtie2-build outputs a set of 6 files with suffixes <reference>.1.bt2
, <reference>.2.bt2
, <reference>.3.bt2
, <reference>.4.bt2
, <reference>.rev.1.bt2
, and <reference>.rev.2.bt2
in 04_READ_MAPPING. These files together constitute the index: they are all that is needed to align reads to that reference in Module 04.
*(3) If you work with long read data sequencing (e.g. PacBio, Oxford Nanpore, Illumina Complete Long reads), we recommand to use the argument --read-type long
. MVP will then use minimap2 to build the index (reference) needed in Module 04. It will return to a single file called reference.mmi
in 04_READ_MAPPING.
Executing Module 04 (Read mapping)
MVP_04_do_read_mapping -h
MVP_04_do_read_mapping -i WORKING_DIRECTORY/ -m data/example_metadata.txt --delete-files
Flag explanations
-i
(required) path to your working directory where all the MVP outputs will be generated and stored.-m
(required) your metadata file.--sample_group
specify one number or a list of numbers in the metadata file you want to process (by default, MVP will process all datasets listed in the metadata file one after the other).--force_read_mapping
argument to force read mapping execution even if the directory already exists (default = do not overwrite existing directory).--read_type
sequencing data type (e.g. short vs long reads) (default = short).--interleaved
argument use Bowtie2 for interleaved reads (default = TRUE).--delete_files
if you want MVP to delete all intermediary files once a sample is done.--threads
set number of threads (default = 1).
Module and output explanations
This module will generate <sample_name>
.sam by using bowtie2 (short reads) or minimap2 (long reads), and <sample_name>
.bam, and <sample_name>
_sorted.bam files by using bowtie2 to map reads from individual samples to the vOTU database generated in Module 03. Then, the module will use CoverM to calculate coverage based on read mapping, using the sorted BAM files sorted by reference, and return to one tabular file per sample : <sample_name>
MVP_04_CoverM.csv containing the different coverage measures for each viral sequences. Finally, a summary report will be generated: <sample_name>
MVP_04_Summary_Report.txt.
Notes:
(1) You can check the Github repositories of Bowtie2, minimap2, Samtools, and CoverM for more explanation on argument and output files.
Executing Module 05 (Creating vOTU tables)
MVP_05_create_vOTU_table -h
MVP_05_create_vOTU_table -i WORKING_DIRECTORY/ -m data/example_metadata.txt
Flag explanations
-i
(required) path to your working directory where all the MVP outputs will be generated and stored.-m
(required) your metadata file.--covered_fraction
minimum horizontal coverage fraction required to consider a coverage in the abundance tables. By default, MVP will output abundance tables with minimum covered fraction of 0.1, 0.5, and 0.9.--normalization
which CoverM coverage metrics (RPKM
orFPKM
) you want for your abundance tables (default = RPKM).--filtration
inclusion criteria you want to apply on the predicted viruses for your abundance tables. Can berelaxed
orconservative
(default = conservative). These two levels are pre-defined combinations of minimum completeness, minimum viral genes, minimum length, and/or maximum host viral genes ratio (see below for a detailed explanation of each). Alternatively, custom cutoffs combinations can be applied using individual arguments below.--viral_min_genes
minimum number of viral genes required to include a virus prediction (based on CheckV annotation, default = 1).--viral_min_length
minimum length required to include a virus prediction (default = 0).--host_viral_genes_ratio
maximum ratio of host genes to viral genes required to include a virus prediction (default = 1).
Module and output explanations
This module will merge all the <sample_name>
_CoverM.csv tabular files to create an unfiltered viral OTU table and save it as Unfiltered_RPKM_vOTU_table in 05_VOTU_TABLES. Then, the module will create a set of viral OTU tables based on the cutoffs (i.e., horizontal coverage) and filtration mode (i.e., conservative and relaxed) you choose. Finally, the module will complete the MVP_05_summary_report.txt generated in Module 03 with an overview of normalized abundance measures for vOTUs.
--filtration
argument:
conservative
will apply the following cutoffs: only include viral sequences predicted as ≥50% complete by CheckV (AAI prediction) or viral sequences ≥5kb.
relaxed
will include all viral sequences that were gathered and clustered in Module 03.
Notes:
(1) We strongly encourage to run the Module 05 with --filtration conservative
to filter your vOTU tables. The conservative mode will only keep Complete, High- and Medium-quality, and ≥5kb Low-quality vOTUs, and is usually a good default for most viral ecogenomics studies.
(2) The last two argument (--viral-min-genes
and --host-viral-genes-ratio
) can also be used when you run Module 02. They are provided in Module 05 to allow the users to run only one round of clustering (Module 03) and read mapping (Module 04), and then test different cutoffs in Module 05 to evaluate their impacts on the final results.
Executing Module 06 (Functional prediction)
MVP_06_do_functional_annotation -h
MVP_06_do_functional_annotation -i WORKING_DIRECTORY/ -m data/example_metadata.txt
Flag explanations
-i
(required) path to your working directory where all the MVP outputs will be generated and stored.-m
(required) your metadata file.--fasta_files
Sequence and protein FASTA files (representative or all sequences) to use for functional annotation (Default = representative').--PHROGS_evalue
Significance e-value of match between target sequences and query (default = 0.01).--PHROGS_score
Score of match between target sequences and query (default = 60).--PFAM_evalue
Significance e-value of match between target sequences and query (default = 0.01).--PFAM_score
Score of match between target sequences and query (default = 50).--ADS
if you want MVP to searchAnti-Defense Systems.--ADS_evalue
Significance e-value of match between target sequences and query (default = 0.01).--ADS_score
Score of match between target sequences and query (default = 60).--ADS_seqid
Sequence identity of match between target sequences and query (default = 30).--RdRP
if you want MVP to create the 07_RDRP_PHYLOGENY folder and search RdRP profiles.--RdRP_evalue
Significance e-value of match between target sequences and query (default = 0.01).--RdRP_score
Score of match between target sequences and query (default = 50).--DRAM
if you want MVP to create an input file to be process through DRAM-v.--force_prodigal
force execution of protein prediction by Prodigal.--force_PHROGS
force PHROGS annotation.--force_PFAM
force PFAM annotation.--force_ADS
force ADS annotation.--force_RdRP
force RdRP annotation.--force_outputs
force creation of final annotation table even though it exists.--delete_files
if you want MVP to delete all intermediary files once the functional annotation is done.--threads
set number of threads (default = 1).
Module and output explanations
This module will use one of the FASTA files (i.e. unfiltered, filtered relaxed contigs, or filtered relaxed vOTUs) containing protein sequences generated by the Module 03 to search protein sequences against multiple databases (i.e. PHROGS, PFAM, Anti-defense systems, and/or RdRP) and return to unfiltered annotation tables in 06_FUNCTIONAL_ANNOTATION, for each annotation. Then, the module will filter all of these tables using score, e-value, and sequence identity thresholds and merge all tables in a a single one will all functional annotation: MVP_06_All_Sample_Filtered_Relaxed_Merged_Genomad_CheckV_Representative_Virus_Proviruses_Gene_Annotation_GENOMAD_PHROGS_PFAM_ADS_RDRP_Filtered.tsv.
If you use argument --RDRP
, the module will create a new folder 06_RDRP_ANNOTATION, search the protein sequences against RdRP HMM profiles, and return to 06C_Filtered_Formatted_RdRP_Profile_Tab.tsv
that can be used for downstream RdRP phylogeny analyses, for example.
If you use --DRAM
, the module will create a new folder 06_DRAM_V, and generate an input table MVP_06_All_Sample_Filtered_Relaxed_Merged_Genomad_CheckV_Representative_Virus_Proviruses_Gene_Annotation_GENOMAD_DRAM_Input.tsv that can be used to run DRAM-v.
Finally, the module will create a MVP_06_summary_report.txt generated with an overview of the functional annotation.
Notes:
(1) We encourage to run the Module 06 with --fasta_files representative
(by default).
Executing Module 07 (Binning viral genomes)
MVP_07_do_binning -h
MVP_07_do_binning -i WORKING_DIRECTORY/ -m data/example_metadata.txt
Flag explanations
-i
(required) path to your working directory where all the MVP outputs will be generated and stored.-m
(required) your metadata file.--binning_sample_group
specify one number or a list of numbers in the metadata file you want to process by vRhyme for binning (by default, MVP will process all datasets listed in the metadata file one after the other).--read_mapping_sample_group
specify one number or a list of numbers in the metadata file you want to process for the read mapping step of the vBins (by default, MVP will process all datasets listed in the metadata file one after the other).--keep_bam
If used, do not remove sorted BAM files generated by vRhyme. Otherwise all unsorted/index BAM files generated from vRhyme read alignment or SAM conversions will be deleted (default = TRUE).--force_vrhyme
argument force vRhyme execution even if the directory 07A_vRHYME_OUTPUT already exists (default = do not overwrite existing directory).--force_checkv
argument force CheckV execution even if the directory 07B_vBINS_CHECKV already exists (default = do not overwrite existing directory).--force_read_mapping
argument force read mapping step even if the directory 07C_vBINS_READ_MAPPING already exists (default = do not overwrite existing directory).--read_type
sequencing data type (e.g. short vs long reads) (default = short).--interleaved
argument use Bowtie2 for interleaved reads (default = TRUE).--delete_files
if you want MVP to delete all intermediary files once a sample is done.--force_outputs
argument force the creation of ouptut tables even if the tables already exist (default = do not overwrite existing files).--filtration
relaxed or conservative vRhyme post-filtration to remove potential contaminated vBins (default = conservative).--threads
set number of threads (default = 1).
Module and output explanations
This module will use the FASTA file containing viral sequences generated by the Module 03 to run vRhyme for binning virus genomes and return outputs in .
The module will generate a folder vRhyme_best_bins_fasta containing fasta files for best vbins and Merged_vRhyme_Outputs_Unfiltered_best_vBins_Memberships_geNomad_CheckV_Summary_Table.tsv
that summarize vRhyme output tables in 07A_vRHYME_OUTPUT.
Then, the module will run CheckV and do the read mapping on vBin sequences and return results in 07B_vBINS_CHECKV and 07C_vBINS_READ_MAPPING, respectively. This will generate two output files: Unfiltered_best_vBins_read_mapping_information_Table.tsv
and Merged_vRhyme_Outputs_Unfiltered_best_vBins_Memberships_geNomad_CheckV_Summary_read_mapping_information_Table.tsv
.
The module will then generate three filtered coverage tables based on the same pattern than the Module 05 (i.e., horizontal coverage) in 07D_vBINS_vOTUS_TABLES.
Finally, the module will generate two fasta files in 07E_FASTA_IPHOP_INPUTS: Filtered_filtration_Prokaryote_Unknown_best_vBins_Representative_Unbinned_vOTUs_Sequences_iPHoP_Input.fasta.tsv
and Filtered_Prokaryote_Only_best_vBins_Representative_Unbinned_vOTUs_Sequences_iPHoP_Input.fasta
that combined vBin and unbinned sequences that are predicted to be phages and unknown or only phages, respectively.
Finally, the module will create a MVP_07_Summary_Report.txt generated with an overview of the functional annotation.
Notes:
(1) The coverage tables will combine filtered vBins and unbinned contigs.
(2) All filtration thresholds (checkV and horizontal coverage) are based on previous modules in order to get consistent outputs.
Executing Module 99 (Prepare NCBI MIUViG submission)
MVP_99_prep_MIUViG_submission -h
MVP_99_prep_MIUViG_submission -i WORKING_DIRECTORY/ -m data/example_metadata.txt
Flag explanations
-i
(required) path to your working directory where all the MVP outputs will be generated and stored.-m
(required) your metadata file.-g
Identifier of the sequence to be processed.-s
Should be one of "setup_metadata" (to be run first) or "prep_submission" (once sequence metadata have been checked and completed).-t
path to the BioSample submission template file, generated from https://submit.ncbi.nlm.nih.gov/genbank/template/submission/, only required for the step 2: prep_submission
Module and output explanations
Executing Module 100 (Summarize outputs)
MVP_100_summarize_outputs -h
MVP_100_summarize_outputs -i WORKING_DIRECTORY/ -m data/example_metadata.txt
Flag explanations
-i
(required) path to your working directory where all the MVP outputs will be generated and stored.-m
(required) your metadata file.--force
force creation of summary reports and figures even though they exist.
Module and output explanations
This module will first find the highest module used by the user, calculate the total MVP running time, and create a final summary report containing all the MVP commands used. Then, the module will find and store the main output tables in 100_SUMMARIZED_OUTPUTS. Finally, the module will execute the corresponding R code based on the highest MVP module ran by the user. This script essentially automates the process of summarizing outputs from the MVP, including generating report, copying final output files, and integrating R-based data analysis and visualization.
Notes:
(1) This module creates a folder within 100_SUMMARIZED_OUTPUTS named with the current date and time. This setup enables users to run the same modules with different arguments or options, or to add new modules. They can then execute Module 100, and this won't delete previous summary folders.
LINK TO GITHUB AND PEER-REVEW PAPERS
Link to geNomad github
Camargo, A.P., Roux, S., Schulz, F., Babinski, M., Xu, Y., Hu, B., Chain, P. S. G., Nayfach, S., & Kyrpides, N. C. You can move, but you can’t hide: identification of mobile genetic elements with geNomad. bioRxiv (2023), DOI: 10.1101/2020.11.01.361691
Link to CheckV github
Nayfach, S., Camargo, A.P, Schulz, F. et al. CheckV assesses the quality and completeness of metagenome-assembled viral genomes. Nat Biotechnol 39, 578–585 (2021). https://doi.org/10.1038/s41587-020-00774-7
Link to Bowtie2 github
Langmead B, Salzberg S. Fast gapped-read alignment with Bowtie 2. Nature Methods. 2012, 9:357-359
Link to minimap2
Li H., New strategies to improve minimap2 alignment accuracy. Bioinformatics. 2021, 37(23):4572–4574.
Link to Samtools github
Li, H., B. Handsaker, A. Wysoker, T. Fennell, J. Ruan, N. Homer, G. Marth, G. Abecasis, R. Durbin, and 1000 Genome Project Data Processing Subgroup. The Sequence Alignment/Map Format and SAMtools. Bioinformatics 25, no. 16 2009: 2078–79.
Link to CoverM github
Link to MMseqs2 github
Mirdita M, Steinegger M, Breitwieser F, Soding J, Levy Karin E. Fast and sensitive taxonomic assignment to metagenomic contigs. Bioinformatics, doi: 10.1093/bioinformatics/btab184 (2021).
Link to HMMER github
Finn, R.D., Clements, J., & Eddy, S. R. HMMER web server: interactive sequence similarity searching. Nucleic acids research, 39(suppl_2), W29-W37 (2011).
COPYRIGHT NOTICE
Multi-choice Viromics Pipeline (MVP) Copyright (c) 2023, The Regents of the University of California, through Lawrence Berkeley National Laboratory (subject to receipt of any required approvals from the U.S. Dept. of Energy). All rights reserved.
If you have questions about your rights to use or distribute this software, please contact Berkeley Lab's Intellectual Property Office at IPO@lbl.gov.
NOTICE. This Software was developed under funding from the U.S. Department of Energy and the U.S. Government consequently retains certain rights. As such, the U.S. Government has been granted for itself and others acting on its behalf a paid-up, nonexclusive, irrevocable, worldwide license in the Software to reproduce, distribute copies to the public, prepare derivative works, and perform publicly and display publicly, and to permit others to do so.
LICENCE AGREEMENT
GPL v3 License
Multi-choice Viromics Pipeline (MVP) Copyright (c) 2023, The Regents of the University of California, through Lawrence Berkeley National Laboratory (subject to receipt of any required approvals from the U.S. Dept. of Energy). All rights reserved.
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file mvip-1.0.0.tar.gz
.
File metadata
- Download URL: mvip-1.0.0.tar.gz
- Upload date:
- Size: 366.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: python-requests/2.31.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 |
f46d97de0916faa6f747d611160df7bb33219337cb9b01de5608b35421aa2b53
|
|
MD5 |
2be8bd58e799b4a9fc116c458c45e86f
|
|
BLAKE2b-256 |
7944b885459341705da9c8bd27f1b03be778eb332d546987b4e03637ac475e02
|
File details
Details for the file mvip-1.0.0-py2.py3-none-any.whl
.
File metadata
- Download URL: mvip-1.0.0-py2.py3-none-any.whl
- Upload date:
- Size: 108.1 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: python-requests/2.31.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 |
c8fde71ac6e6960717cd9ac7d678ea0edfa27b92c9a8c3dbdf60cf267a54a04c
|
|
MD5 |
f7c2ca1ed510a18907d8b1dafb292c81
|
|
BLAKE2b-256 |
9e630b7cb44496958eb493c377ea7f73fee9e997e97087804f2e1811c4212979
|