An RNA/DNA virus strain-level identification tool for short reads.
Project description
VirStrain 
VirStrain is an RNA virus strain-level identification tool for short-read sequencing data.
Overview
VirStrain supports:
- Strain identification from single-end and paired-end short reads
- Strain identification from assembled contigs
- Construction of custom VirStrain databases
- Use of pre-built public databases for common viral species
Contact
- Email: heruiliao2-c@my.cityu.edu.hk
- Recommended version: v1.18
- Legacy note: v1.14 fixed some bugs, but did not include
virstrain_contigorvirstrain_merge
Changelog
2024 updates
2024-05-28
- v1.17: Added the
-vparameter to display version information
Available in the GitHub version only
2024-03-11
- v1.17: Synced all changes to both GitHub and Conda
2024-02-27
Tem_Vsfiles are now named randomly in the GitHub version- Added links for downloading pre-built databases
2023 updates
2023-10-12
- v1.14: Fixed a bug in v1.13 related to handling gzipped FASTQ files
2023-09-05
- Added a new function for contig-based viral strain identification
- Supports comprehensive identification across 45,619 strains from 28 viral species
2022 updates
2022-12-20
- v1.13: Fixed a database generation bug present in v1.12 of the Bioconda release
2022-12-16
- The VirStrain web server extension, StrainDetect, is now online:
https://strain.ee.cityu.edu.hk
2022-11-10
- Added parameter
-sto sort the most likely strain by site matches
2022-03-23
- Fixed a Perl script bug related to header name handling
2022-02-08
- Added an alternative method for downloading databases from Figshare
2022-02-05
- v1.12: VirStrain can now accept gzipped FASTQ input files
2021 updates
2021-11
- Added downloadable databases for two DNA viruses used in the paper:
- HBV
- HCMV
- Added a larger SARS-CoV-2 database
See Supplementary Section 1.1 of the paper
Requirements
Dependencies
- Python >= 3.10
- Recommended: 3.10.19
- Should work on python >3.11 as well
- Perl
- Python packages:
networkx==3.3numpy==1.26.4pandas==2.3.3biopython==1.84plotly==6.5.0
- Bowtie2
Required for VirStrain version >= v1.18
If you use Conda, you can install required packages automatically with:
sh install_package.sh
If you install VirStrain via Bioconda or pip, you can ignore manual dependency installation.
Installation
Supported platform: Linux / Ubuntu
Option 1: Install with Bioconda
Once Bioconda is configured:
conda install -c bioconda virstrain
chmod 755 bin/jellyfish-linux
Option 2: Install with pip
pip install virstrain==1.18
chmod 755 bin/jellyfish-linux
Option 3: Manual installation
Make sure all dependencies are installed first.
git clone https://github.com/liaoherui/VirStrain.git
cd VirStrain
chmod 755 bin/jellyfish-linux
rm VirStrain_DB.tar.gz
Command mapping
If you installed VirStrain via Bioconda or pip, use the following command names:
| Source install command | Bioconda / pip command |
|---|---|
python VirStrain.py -h |
virstrain -h |
python VirStrain_build.py -h |
virstrain_build -h |
python VirStrain_contig.py -h |
virstrain_contig -h |
python VirStrain_contigDB_merge.py -h |
virstrain_merge -h |
Databases
Download the default reference database
After cloning the repository:
cd VirStrain
sh download.sh
Alternative download method
cd VirStrain
wget -qO- "https://figshare.com/ndownloader/files/34002479" | tar -zx
You may also download the database manually from Google Drive or Figshare and extract it with:
tar -zxvf <downloaded_file>
If all download methods fail, please contact the author by email.
Additional downloadable databases
DNA virus databases
sh download_dna.sh
Includes databases for:
- HBV
- HCMV
Larger SARS-CoV-2 database
sh download_scov2_big.sh
Contig-level database
sh download_contig_db.sh
Pre-built database downloads
If the download scripts fail, pre-built databases are also available via Google Drive.
| Name | Description | Download |
|---|---|---|
VirStrain_DB.tar.gz |
Databases containing SCOV2, H1N1, and HIV strains used in the paper | Google Drive |
SCOV2_newBig.tar.gz |
Expanded database containing additional SCOV2 strains | Google Drive |
VirStrain_DNA_DB.tar.gz |
Databases containing HBV and HCMV strains | Google Drive |
VirStrain_contig_DB.tar.gz |
Contig-level database | Google Drive |
Usage
If you installed VirStrain via Bioconda or pip, replace script-based commands with the corresponding installed commands shown above.
1) Identify RNA virus strains from short reads
Single-end reads
python VirStrain.py -i Test_Data/MT451123_1.fq -d VirStrain_DB/SCOV2 -o MT451123_SE_Test
Paired-end reads
python VirStrain.py -i Test_Data/MT451123_1.fq -p Test_Data/MT451123_2.fq -d VirStrain_DB/SCOV2 -o MT451123_PE_Test
High-mutation viruses such as HIV
Use the -m option.
Single-end:
python VirStrain.py -i <Read1> -d VirStrain_DB/HIV -o <Output_dir> -m
Paired-end:
python VirStrain.py -i <Read1> -p <Read2> -d VirStrain_DB/HIV -o <Output_dir> -m
2) Identify viral strains from assembled contigs
python VirStrain_contig.py -i <Input_Contig_fasta> -d VirStrain_contig_DB -o VirStrain_Contig_Res
Convert read-based databases into a contig database
python VirStrain_contigDB_merge.py -i VirStrain_DB/SCOV2,VirStrain_DB/H1N1 -o VirStrain_contig_DB_merge
3) Build a custom VirStrain database
Basic usage:
python VirStrain_build.py -i <Input_MSA> -d <Database_Dir>
Important header naming rule
Characters , and | are not allowed in sequence headers in <Input_MSA>.
Examples:
- Not allowed:
>Strain_A, 2022 - Not allowed:
>Strain_A|2022 - Allowed:
>Strain_A_2022
Manual covering for small datasets or large viral genomes
For small strain collections (<1000 strains) or viruses with large genomes such as HCMV, you can use the manual covering function with -s to retain more useful sites.
Example:
python VirStrain_build.py -i <Input_MSA> -d <Database_Dir> -s 0.4
General guidance:
0.2–0.6is usually a reasonable range for-s- With very few strains (for example, 3 strains), a larger value such as
-s 0.8may also work
Restrict SNV site range
If you only want to use SNV sites from position x to y, use -r.
Example:
python VirStrain_build.py -i <Input_MSA> -d <Database_Dir> -s 0.4 -r 500-1000
Input format note
The input MSA must have the same format as an alignment generated by MAFFT:
https://mafft.cbrc.jp/alignment/software/
Full command-line options
VirStrain.py — short-read strain identification
Default k-mer size: 25
VirStrain - An RNA virus strain-level identification tool for short reads.
Example:
python VirStrain.py -i Test_Data/MT451123_1.fq -p Test_Data/MT451123_2.fq -d VirStrain_DB/SCOV2 -o MT451123_PE_Test
required arguments:
-i, --input_reads Input FASTQ data
-d, --database_dir Path to VirStrain database
optional arguments:
-h, --help Show help message and exit
-o, --output_dir Output directory (default: ./VirStrain_Out)
-p, --input_reads2 Input FASTQ data for paired-end reads
-c, --site_filter_cutoff Site filtering cutoff used when calculating Vscore (default: 0.05)
-s, --rank_by_sites If set to 1, sort the most likely strain by site matches (default: 0)
-f, --turn_off_figures If set to 1, do not generate figures (default: 0)
-m, --high_mutation_virus Use for high mutation rate viruses such as HIV
VirStrain_build.py — custom database construction
Default k-mer size: 25
VirStrain - An RNA virus strain-level identification tool for short reads.
Example:
python VirStrain_build.py -i <Input_MSA> -d <Database_Dir>
required arguments:
-i, --input_msa Input MSA file (must match MAFFT output format)
optional arguments:
-d, --database_dir Output directory for the constructed database (default: ./VirStrain_DB)
-c, --dash_cutoff Dash cutoff for each MSA column (default: 0)
-s, --sites_cutoff Cutoff for manual-covering function
(e.g. 1 = all useful sites; 0.8 = 80% of useful sites)
-r, --sites_rcutoff Site range cutoff for covering algorithm
(e.g. 3-500 means only SNV sites from positions 3 to 500 are considered)
Output format
VirStrain generates two primary outputs:
-
A text report
- Contains identified strains, depth, site coverage, and related metrics
-
An interactive HTML report
- Displays depth and site uniqueness information visually
You can find an example output in the MT451123_Sim_PE folder in this repository.
Example report image:
Report sections
| Header | Description |
|---|---|
| Most Possible strain* | The most likely strain detected by VirStrain. These are the strains with the highest Vscore in the first iteration. |
| Other Possible strains* | Additional possible strains detected by VirStrain. These are identified in later iterations. Based on the authors’ experiments, 10 mutations can be strong evidence for additional possible strains. |
Highest Map Strains |
The strain with the maximum Covered SNV site / Total SNV site in the first iteration. Provided for reference. |
Top 10 Score Strains |
The top 10 strains ranked by Vscore in the first iteration. This can help identify low-abundance strains highly similar to high-abundance strains. |
Headers marked with
*contain the main identification results.
Report columns
| Column | Description |
|---|---|
Strain_ID |
NCBI accession number or other public database identifier for the identified strain |
Cls_info |
Cluster information for the identified strain, e.g. Cluster2830_2 means cluster Cluster2830 with size 2 |
SubCls_info |
Sub-cluster information |
Vscore |
Score generated by the VirStrain algorithm |
Total_Map_Rate |
Covered sites out of total sites in the first iteration |
Valid_Map_Rate |
Covered sites out of total sites in the remaining iterations |
Strain_depth |
Predicted sequencing depth for the identified strain |
Strain_info |
Metadata for the identified strain, such as region and subtype |
SNV_freq |
SNV frequency across all sites |
Citation
If you use VirStrain, please cite:
Liao, H., Cai, D. & Sun, Y. VirStrain: a strain identification tool for RNA viruses. Genome Biology 23, 38 (2022). https://doi.org/10.1186/s13059-022-02609-x
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file virstrain-1.18.tar.gz.
File metadata
- Download URL: virstrain-1.18.tar.gz
- Upload date:
- Size: 1.4 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.20
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0197ed06bb04d7bc4e097579be99324dac889a233cee797e7576cd9270478d46
|
|
| MD5 |
cb560dd01888e4f3ca4650c2de00c2ec
|
|
| BLAKE2b-256 |
145c096218fc3dbab1f94510ed344cd9fd332c6ab8b9809ff7a50617772fbae1
|
File details
Details for the file virstrain-1.18-py3-none-any.whl.
File metadata
- Download URL: virstrain-1.18-py3-none-any.whl
- Upload date:
- Size: 1.5 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.20
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
19a9dd1d7f0d78c526f672b1263131623a55eadc3c91c70a1c01055d17e59e66
|
|
| MD5 |
0ee1553889db9c6d2f6237c4d78838f1
|
|
| BLAKE2b-256 |
8166507c3feed6c82e07649a9e793772b4d71c287ab115d4516d2d9db271e351
|