A tool for generating influenza A virus genome sequences from FASTQ data
Project description
FluViewer
A tool for generating influenza A virus genome sequences from FASTQ data
Installation
- FluViewer requires the following dependencies, and it is recommended to install them in a FluViewer virtual environment (indicated versions were tested, but later versions can likely be substituted):
- python v3.8.5
- pandas v1.3.5
- spades v3.15.3
- blast v2.12.0
- bwa v0.7.17
- samtools v1.14
- bcftools v1.14
- bedtools v2.30.0
- seqtk v1.3
- Once the dependencies have been installed, install the latest FluViewer release via PyPI:
pip3 install FluViewer
- Download and unzip the default FluViewer DB (FluViewer_db.fa.gz) from this repository. Custom DBs can be created and used as well (instructions below).
Usage
FluViewer -f <path_to_fwd_reads> -r <path_to_rev_reads> -d <path_to_db_file> -o <output_name> -m <mode> [-D <min_depth> -q <min_qual> -c <min_cov> -i <min_id>] [-g]
Required arguments:
-f : path to FASTQ file containing forward reads
-r : path to FASTQ file containing reverse reads
-d : path to FASTA file containing FluViewer database (details below)
-o : output name (creates directory with this name for output, includes this name in output files, and in consensus sequence headers)
-m : FluViewer run mode (align or assemble)
Optional arguments:
-D : Minimum read depth for base calling (default = 20)
-q : Minimum PHRED score for base quality and mapping quality (default = 30)
-c : Minimum coverage of database reference sequence by contig (percentage, default = 25)
-i : Minimum nucleotide sequence identity between database reference sequence and contig (percentage, default = 95)
Optional flags:
-g : Set this flag to deactivate garbage collection and retain intermediate files
FluViewer Database
FluViewer requires a curated FASTA file "database" of influenza A virus reference sequences. Headers for these sequences must be formatted and annotated as follows:
>unique_id|strain_name|segment|subtype
For example:
>MF599463|A/swine/Kansas/A01378028/2017|HA|H3
FluViewer Output
FluViewer generates three outputfiles:
- A FASTA file containing consensus sequences for influenza A virus genome segments
- A sorted BAM file with reads mapped to either the choosen reference sequences (align mode) or the assembled contigs (assembly mode)
- A report TSV file describing segment, subtype, and sequencing metrics for each consensus sequence
Headers in the FASTA file have the following format:
>output_name_unique_sequence_number|segment|subject
The report TSV file contains the following columns:
consensus_seq : the name of the consensus sequence described by this row
segment : influenza A virus genome segment (PB2, PB1, PA, HA, NP, NA, M, NS)
subtype : HA or NA subtype ("none" for internal segments)
mapped reads : the number of sequencing reads mapped to this segment
seq_length : the length (in nucleotides) of the consensus sequence generated by FluViewer
sequenced_bases : the number of nucleotide positions in the consensus sequence with sufficient depth of coverage (set by -D argument) and a succesful base call (e.g. A, T, G, or C)
segment_cov : the number of sequenced bases in the consensus sequence divided by the typical length of this genome segment (as a percentage). The typical segment length is determined by finding the median length of the segment/subject reference sequences whose contig alignments have the highest bitscore.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for FluViewer-0.0.2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5a488a2b42e32d5cdee5c66a8db18e9ecc11068b803fa00444fd284713d1275f |
|
MD5 | 1934d6454838c8e321f694564dd0758f |
|
BLAKE2b-256 | 1ab142198033819b1a87c759d2b24ee68178a152542da2afe52bf2c6940b73bb |