An RNA/DNA virus strain-level identification tool for short reads.

These details have not been verified by PyPI

Project links

Homepage

Project description

VirStrain

VirStrain is an RNA virus strain-level identification tool for short-read sequencing data.

Overview

VirStrain supports:

Strain identification from single-end and paired-end short reads
Strain identification from assembled contigs
Construction of custom VirStrain databases
Use of pre-built public databases for common viral species

Contact

Email: heruiliao2-c@my.cityu.edu.hk
Recommended version: v1.18
Legacy note: v1.14 fixed some bugs, but did not include virstrain_contig or virstrain_merge

Changelog

2024 updates

2024-05-28

v1.17: Added the -v parameter to display version information
Available in the GitHub version only

2024-03-11

v1.17: Synced all changes to both GitHub and Conda

2024-02-27

Tem_Vs files are now named randomly in the GitHub version
Added links for downloading pre-built databases

2023 updates

2023-10-12

v1.14: Fixed a bug in v1.13 related to handling gzipped FASTQ files

2023-09-05

Added a new function for contig-based viral strain identification
Supports comprehensive identification across 45,619 strains from 28 viral species

2022 updates

2022-12-20

v1.13: Fixed a database generation bug present in v1.12 of the Bioconda release

2022-12-16

The VirStrain web server extension, StrainDetect, is now online:
https://strain.ee.cityu.edu.hk

2022-11-10

Added parameter -s to sort the most likely strain by site matches

2022-03-23

Fixed a Perl script bug related to header name handling

2022-02-08

Added an alternative method for downloading databases from Figshare

2022-02-05

v1.12: VirStrain can now accept gzipped FASTQ input files

2021 updates

2021-11

Added downloadable databases for two DNA viruses used in the paper:
- HBV
- HCMV
Added a larger SARS-CoV-2 database
See Supplementary Section 1.1 of the paper

Requirements

Dependencies

Python >= 3.10
- Recommended: 3.10.19
- Should work on python >3.11 as well
Perl
Python packages:
- networkx==3.3
- numpy==1.26.4
- pandas==2.3.3
- biopython==1.84
- plotly==6.5.0
Bowtie2
Required for VirStrain version >= v1.18

If you use Conda, you can install required packages automatically with:

sh install_package.sh

If you install VirStrain via Bioconda or pip, you can ignore manual dependency installation.

Installation

Supported platform: Linux / Ubuntu

Option 1: Install with Bioconda

Once Bioconda is configured:

conda install -c bioconda virstrain
chmod 755 bin/jellyfish-linux

Option 2: Install with pip

pip install virstrain==1.18
chmod 755 bin/jellyfish-linux

Option 3: Manual installation

Make sure all dependencies are installed first.

git clone https://github.com/liaoherui/VirStrain.git
cd VirStrain
chmod 755 bin/jellyfish-linux
rm VirStrain_DB.tar.gz

Command mapping

If you installed VirStrain via Bioconda or pip, use the following command names:

Source install command	Bioconda / pip command
`python VirStrain.py -h`	`virstrain -h`
`python VirStrain_build.py -h`	`virstrain_build -h`
`python VirStrain_contig.py -h`	`virstrain_contig -h`
`python VirStrain_contigDB_merge.py -h`	`virstrain_merge -h`

Databases

Download the default reference database

After cloning the repository:

cd VirStrain
sh download.sh

Alternative download method

cd VirStrain
wget -qO- "https://figshare.com/ndownloader/files/34002479" | tar -zx

You may also download the database manually from Google Drive or Figshare and extract it with:

tar -zxvf <downloaded_file>

If all download methods fail, please contact the author by email.

Additional downloadable databases

DNA virus databases

sh download_dna.sh

Includes databases for:

HBV
HCMV

Larger SARS-CoV-2 database

sh download_scov2_big.sh

Contig-level database

sh download_contig_db.sh

Pre-built database downloads

If the download scripts fail, pre-built databases are also available via Google Drive.

Name	Description	Download
`VirStrain_DB.tar.gz`	Databases containing SCOV2, H1N1, and HIV strains used in the paper	Google Drive
`SCOV2_newBig.tar.gz`	Expanded database containing additional SCOV2 strains	Google Drive
`VirStrain_DNA_DB.tar.gz`	Databases containing HBV and HCMV strains	Google Drive
`VirStrain_contig_DB.tar.gz`	Contig-level database	Google Drive

Usage

If you installed VirStrain via Bioconda or pip, replace script-based commands with the corresponding installed commands shown above.

1) Identify RNA virus strains from short reads

Single-end reads

python VirStrain.py -i Test_Data/MT451123_1.fq -d VirStrain_DB/SCOV2 -o MT451123_SE_Test

Paired-end reads

python VirStrain.py -i Test_Data/MT451123_1.fq -p Test_Data/MT451123_2.fq -d VirStrain_DB/SCOV2 -o MT451123_PE_Test

High-mutation viruses such as HIV

Use the -m option.

Single-end:

python VirStrain.py -i <Read1> -d VirStrain_DB/HIV -o <Output_dir> -m

Paired-end:

python VirStrain.py -i <Read1> -p <Read2> -d VirStrain_DB/HIV -o <Output_dir> -m

2) Identify viral strains from assembled contigs

python VirStrain_contig.py -i <Input_Contig_fasta> -d VirStrain_contig_DB -o VirStrain_Contig_Res

Convert read-based databases into a contig database

python VirStrain_contigDB_merge.py -i VirStrain_DB/SCOV2,VirStrain_DB/H1N1 -o VirStrain_contig_DB_merge

3) Build a custom VirStrain database

Basic usage:

python VirStrain_build.py -i <Input_MSA> -d <Database_Dir>

Important header naming rule

Characters , and | are not allowed in sequence headers in <Input_MSA>.

Examples:

Not allowed: >Strain_A, 2022
Not allowed: >Strain_A|2022
Allowed: >Strain_A_2022

Manual covering for small datasets or large viral genomes

For small strain collections (<1000 strains) or viruses with large genomes such as HCMV, you can use the manual covering function with -s to retain more useful sites.

Example:

python VirStrain_build.py -i <Input_MSA> -d <Database_Dir> -s 0.4

General guidance:

0.2–0.6 is usually a reasonable range for -s
With very few strains (for example, 3 strains), a larger value such as -s 0.8 may also work

Restrict SNV site range

If you only want to use SNV sites from position x to y, use -r.

Example:

python VirStrain_build.py -i <Input_MSA> -d <Database_Dir> -s 0.4 -r 500-1000

Input format note

The input MSA must have the same format as an alignment generated by MAFFT:

https://mafft.cbrc.jp/alignment/software/

Full command-line options

VirStrain.py — short-read strain identification

Default k-mer size: 25

VirStrain - An RNA virus strain-level identification tool for short reads.

Example:
python VirStrain.py -i Test_Data/MT451123_1.fq -p Test_Data/MT451123_2.fq -d VirStrain_DB/SCOV2 -o MT451123_PE_Test

required arguments:
    -i, --input_reads             Input FASTQ data
    -d, --database_dir            Path to VirStrain database

optional arguments:
    -h, --help                    Show help message and exit
    -o, --output_dir              Output directory (default: ./VirStrain_Out)
    -p, --input_reads2            Input FASTQ data for paired-end reads
    -c, --site_filter_cutoff      Site filtering cutoff used when calculating Vscore (default: 0.05)
    -s, --rank_by_sites           If set to 1, sort the most likely strain by site matches (default: 0)
    -f, --turn_off_figures        If set to 1, do not generate figures (default: 0)
    -m, --high_mutation_virus     Use for high mutation rate viruses such as HIV

VirStrain_build.py — custom database construction

Default k-mer size: 25

VirStrain - An RNA virus strain-level identification tool for short reads.

Example:
python VirStrain_build.py -i <Input_MSA> -d <Database_Dir>

required arguments:
    -i, --input_msa               Input MSA file (must match MAFFT output format)

optional arguments:
    -d, --database_dir            Output directory for the constructed database (default: ./VirStrain_DB)
    -c, --dash_cutoff             Dash cutoff for each MSA column (default: 0)
    -s, --sites_cutoff            Cutoff for manual-covering function
                                  (e.g. 1 = all useful sites; 0.8 = 80% of useful sites)
    -r, --sites_rcutoff           Site range cutoff for covering algorithm
                                  (e.g. 3-500 means only SNV sites from positions 3 to 500 are considered)

Output format

VirStrain generates two primary outputs:

A text report
- Contains identified strains, depth, site coverage, and related metrics
An interactive HTML report
- Displays depth and site uniqueness information visually

You can find an example output in the MT451123_Sim_PE folder in this repository.

Example report image:

VirStrain Report

Report sections

Header	Description
Most Possible strain*	The most likely strain detected by VirStrain. These are the strains with the highest Vscore in the first iteration.
Other Possible strains*	Additional possible strains detected by VirStrain. These are identified in later iterations. Based on the authors’ experiments, 10 mutations can be strong evidence for additional possible strains.
`Highest Map Strains`	The strain with the maximum `Covered SNV site / Total SNV site` in the first iteration. Provided for reference.
`Top 10 Score Strains`	The top 10 strains ranked by Vscore in the first iteration. This can help identify low-abundance strains highly similar to high-abundance strains.

Headers marked with * contain the main identification results.

Report columns

Column	Description
`Strain_ID`	NCBI accession number or other public database identifier for the identified strain
`Cls_info`	Cluster information for the identified strain, e.g. `Cluster2830_2` means cluster `Cluster2830` with size `2`
`SubCls_info`	Sub-cluster information
`Vscore`	Score generated by the VirStrain algorithm
`Total_Map_Rate`	Covered sites out of total sites in the first iteration
`Valid_Map_Rate`	Covered sites out of total sites in the remaining iterations
`Strain_depth`	Predicted sequencing depth for the identified strain
`Strain_info`	Metadata for the identified strain, such as region and subtype
`SNV_freq`	SNV frequency across all sites

Citation

If you use VirStrain, please cite:

Liao, H., Cai, D. & Sun, Y. VirStrain: a strain identification tool for RNA viruses. Genome Biology 23, 38 (2022). https://doi.org/10.1186/s13059-022-02609-x

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

1.18

Apr 5, 2026

1.17

Mar 9, 2024

1.16

Mar 9, 2024

1.15

Mar 9, 2024

1.14

Oct 12, 2023

1.13

Dec 20, 2022

1.12

Feb 5, 2022

1.10

Nov 11, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

virstrain-1.18.tar.gz (1.4 MB view details)

Uploaded Apr 5, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

virstrain-1.18-py3-none-any.whl (1.5 MB view details)

Uploaded Apr 5, 2026 Python 3

File details

Details for the file virstrain-1.18.tar.gz.

File metadata

Download URL: virstrain-1.18.tar.gz
Upload date: Apr 5, 2026
Size: 1.4 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.20

File hashes

Hashes for virstrain-1.18.tar.gz
Algorithm	Hash digest
SHA256	`0197ed06bb04d7bc4e097579be99324dac889a233cee797e7576cd9270478d46`
MD5	`cb560dd01888e4f3ca4650c2de00c2ec`
BLAKE2b-256	`145c096218fc3dbab1f94510ed344cd9fd332c6ab8b9809ff7a50617772fbae1`

See more details on using hashes here.

File details

Details for the file virstrain-1.18-py3-none-any.whl.

File metadata

Download URL: virstrain-1.18-py3-none-any.whl
Upload date: Apr 5, 2026
Size: 1.5 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.20

File hashes

Hashes for virstrain-1.18-py3-none-any.whl
Algorithm	Hash digest
SHA256	`19a9dd1d7f0d78c526f672b1263131623a55eadc3c91c70a1c01055d17e59e66`
MD5	`0ee1553889db9c6d2f6237c4d78838f1`
BLAKE2b-256	`8166507c3feed6c82e07649a9e793772b4d71c287ab115d4516d2d9db271e351`

See more details on using hashes here.

virstrain 1.18

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

VirStrain

Overview

Contact

Changelog

2024-05-28

2024-03-11

2024-02-27

2023-10-12

2023-09-05

2022-12-20

2022-12-16

2022-11-10

2022-03-23

2022-02-08

2022-02-05

2021-11

Requirements

Dependencies

Installation

Option 1: Install with Bioconda

Option 2: Install with pip

Option 3: Manual installation

Command mapping

Databases

Download the default reference database

Alternative download method

Additional downloadable databases

DNA virus databases

Larger SARS-CoV-2 database

Contig-level database

Pre-built database downloads

Usage

1) Identify RNA virus strains from short reads

Single-end reads

Paired-end reads

High-mutation viruses such as HIV

2) Identify viral strains from assembled contigs

Convert read-based databases into a contig database

3) Build a custom VirStrain database

Important header naming rule

Manual covering for small datasets or large viral genomes

Restrict SNV site range

Input format note

Full command-line options

Output format

Report sections

Report columns

Citation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes