Annotate Influenza A virus gene segment sequences and output GFF3 files.
Project description
gfflu
gfflu
is a Python CLI app to generate annotations of Influenza A virus (IAV) gene segment nucleotide sequences with
BLASTX and Miniprot using the same protein sequences as Influenza Virus Sequence Annotation Tool and
output a GFF3 file with the expected genetic features for each of the 8 IAV gene segments.
Table of Contents
Usage
Below is an example of typical usage with a FASTA nucleotide sequence file (Segment_4_HA.MH201222.fasta):
gfflu Segment_4_HA.MH201222.fasta
Produces an output directory gfflu-outdir/
by default with the following files:
$ tree gfflu-outdir/
gfflu-outdir/
├── Segment_4_HA.MH201222.blastx.tsv
├── Segment_4_HA.MH201222.faa
├── Segment_4_HA.MH201222.gbk
├── Segment_4_HA.MH201222.gff
└── Segment_4_HA.MH201222.miniprot.gff
1 directory, 4 files
Specify output directory with
-o /path/to/outdir
Help output:
Usage: gfflu [OPTIONS] FASTA
Annotate Influenza A virus sequences using Miniprot and BLASTX
The Miniprot GFF for a particular reference sequence gene segment will have multiple annotations for the same gene. This script will select the top scoring annotation for each gene and write out a new GFF file that can be used
with SnpEff.
╭─ Arguments ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ * fasta FILE Influenza virus nucleotide sequence FASTA file [default: None] [required] │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Options ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ --outdir -o PATH Output directory [default: gfflu-outdir] │
│ --force -f Overwrite existing files │
│ --prefix -p TEXT Output file prefix [default: None] │
│ --verbose -v │
│ --version -V Print 'gfflu version 0.0.2' and exit │
│ --install-completion Install completion for the current shell. │
│ --show-completion Show completion for the current shell, to copy it or customize the installation. │
│ --help Show this message and exit. │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
gfflu version 0.0.2; Python 3.10.5
Installation
Conda
This is the recommended installation method.
conda install -c bioconda gfflu
PyPI
pip install gfflu
This install method assumes that you have BLAST+ and Miniprot
installed and on your$PATH
.
From Source
Recommended to use conda to manage the environment from
the provided environment.yml
file.
git clone https://github.com/CFIA-NCFAD/gfflu.git
cd gfflu
conda env create -f environment.yml
conda activate gfflu
Annotation
gfflu
outputs a SnpEff compatible GFF with the same features identified as the
Influenza Virus Sequence Annotation Tool.
Segment 1
Influenza Virus Sequence Annotation Tool output
>Feature MH201221
16 2295 gene
gene PB2
16 2295 CDS
product polymerase PB2
protein_id MH201221p1
gene PB2
INFO: Length: 2316 nucleotides
INFO: Segment: 1 (PB2)
INFO: Sequence completeness: protein 1 - complete; nucleotide - complete
INFO: This sequence (MH201221) contains following signature mutation(s) that might confer high virulence of the virus: (E627K)
INFO: Virus type: influenza A
NCBI Genbank GFF for MH201221.1
##gff-version 3
#!gff-spec-version 1.21
#!processor NCBI annotwriter
##sequence-region MH201221.1 1 2316
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=11320
MH201221.1 Genbank region 1 2316 . + . ID=MH201221.1:1..2316;Dbxref=taxon:11320;Name=1;gbkey=Src;isolation-source=embyonated chicken eggs;mol_type=viral cRNA;note=laboratory-derived;segment=1;serotype=H1N1;strain=A/PR/8_RGCDC-4%2C6/34
MH201221.1 Genbank gene 16 2295 . + . ID=gene-PB2;Name=PB2;gbkey=Gene;gene=PB2;gene_biotype=protein_coding
MH201221.1 Genbank CDS 16 2295 . + 0 ID=cds-AVY92608.1;Parent=gene-PB2;Dbxref=NCBI_GP:AVY92608.1;Name=AVY92608.1;gbkey=CDS;gene=PB2;product=polymerase PB2;protein_id=AVY92608.1
gfflu
GFF
##gff-version 3
##sequence-region MH201221 1 2295
MH201221 miniprot gene 16 2295 3747 + . ID=gene-PB2;Identity=0.9631;Name=PB2;Positive=0.9842;Rank=1;Target=PB2%7CCDS%7Cpolymerase_PB2%7CSeg1prot1A 1 759;gene=PB2;gene_biotype=protein_coding
MH201221 miniprot CDS 16 2295 3747 . 0 ID=cds-PB2;Identity=0.9631;Parent=gene-PB2;Rank=1;Target=PB2%7CCDS%7Cpolymerase_PB2%7CSeg1prot1A 1 759;gene=PB2;product=polymerase PB2
Segment 2
Influenza Virus Sequence Annotation Tool output
>Feature CY147460
13 2286 gene
gene PB1
13 2286 CDS
product polymerase PB1
protein_id CY147460p1
gene PB1
107 370 gene
gene PB1-F2
107 370 CDS
product PB1-F2 protein
protein_id CY147460p2
gene PB1-F2
INFO: Length: 2316 nucleotides
INFO: Segment: 2 (PB1)
INFO: Sequence completeness: protein 1 - complete; protein 2 - complete; nucleotide - complete
INFO: Virus type: influenza A
NCBI Genbank GFF for CY147460.1
##gff-version 3
#!gff-spec-version 1.21
#!processor NCBI annotwriter
##sequence-region CY147460.1 1 2316
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=1343803
CY147460.1 Genbank region 1 2316 . + . ID=CY147460.1:1..2316;Dbxref=taxon:1343803;Name=2;collection-date=1934;country=Puerto Rico;gbkey=Src;lab-host=? + egg2 passage(s);mol_type=viral cRNA;nat-host=human;note=Strain PR8-LVD2 is phenotypically distinct from PR8 molecular clone;segment=2;serotype=H1N1;strain=A/Puerto Rico/8-LVD2/1934
CY147460.1 Genbank sequence_feature 1 2316 . + . ID=id-CY147460.1:1..2316;Dbxref=IRD:NIGSP_JY2_00027.PB1;gbkey=misc_feature
CY147460.1 Genbank gene 13 2286 . + . ID=gene-PB1;Name=PB1;gbkey=Gene;gene=PB1;gene_biotype=protein_coding
CY147460.1 Genbank CDS 13 2286 . + 0 ID=cds-AGQ47939.1;Parent=gene-PB1;Dbxref=NCBI_GP:AGQ47939.1;Name=AGQ47939.1;gbkey=CDS;gene=PB1;product=polymerase PB1;protein_id=AGQ47939.1
CY147460.1 Genbank gene 107 370 . + . ID=gene-PB1-F2;Name=PB1-F2;gbkey=Gene;gene=PB1-F2;gene_biotype=protein_coding
CY147460.1 Genbank CDS 107 370 . + 0 ID=cds-AGQ47940.1;Parent=gene-PB1-F2;Dbxref=NCBI_GP:AGQ47940.1;Name=AGQ47940.1;gbkey=CDS;gene=PB1-F2;product=PB1-F2 protein;protein_id=AGQ47940.1
gfflu
GFF
##gff-version 3
##sequence-region CY147460 1 2286
CY147460 miniprot gene 13 2286 3892 + . ID=gene-PB1;Identity=0.9762;Name=PB1;Positive=0.9974;Rank=1;Target=PB1%7CCDS%7Cpolymerase_PB1%7Cseg2prot1B 1 757;gene=PB1;gene_biotype=protein_coding
CY147460 miniprot CDS 13 2286 3892 . 0 ID=cds-PB1;Identity=0.9762;Parent=gene-PB1;Rank=1;Target=PB1%7CCDS%7Cpolymerase_PB1%7Cseg2prot1B 1 757;gene=PB1;product=polymerase PB1
CY147460 feature gene 107 370 . + . ID=gene-PB1-F2;Target=PB1-F2%7CCDS%7CPB1-F2_protein%7Cseg2prot2M;gene=PB1-F2;gene_biotype=protein_coding
CY147460 feature CDS 107 370 . + 0 ID=cds-PB1-F2;Parent=gene-PB1-F2;Target=PB1-F2%7CCDS%7CPB1-F2_protein%7Cseg2prot2M;gene=PB1-F2;product=PB1-F2 protein
Segment 3
Influenza Virus Sequence Annotation Tool output
>Feature CY146806
13 2163 gene
gene PA
13 2163 CDS
product polymerase PA
protein_id CY146806p1
gene PA
13 772 gene
gene PA-X
13 582 CDS
584 772
product PA-X protein
protein_id CY146806p2
exception ribosomal slippage
gene PA-X
INFO: Length: 2208 nucleotides
INFO: Segment: 3 (PA)
INFO: Sequence completeness: protein 1 - complete; protein 2 - complete; nucleotide - complete
INFO: Virus type: influenza A
NCBI Genbank GFF for CY146806.1
##gff-version 3
#!gff-spec-version 1.21
#!processor NCBI annotwriter
##sequence-region CY146806.1 1 2208
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=1346461
CY146806.1 Genbank region 1 2208 . + . ID=CY146806.1:1..2208;Dbxref=taxon:1346461;Name=3;country=USA: Texas;gbkey=Src;lab-host=? + egg2 passage(s);mol_type=viral cRNA;nat-host=human;segment=3;serotype=H3N2;strain=A/Texas/JY2/unknown
CY146806.1 Genbank sequence_feature 1 2208 . + . ID=id-CY146806.1:1..2208;Dbxref=IRD:NIGSP_JY2_00014.PA;gbkey=misc_feature
CY146806.1 Genbank gene 13 2163 . + . ID=gene-PA;Name=PA;gbkey=Gene;gene=PA;gene_biotype=protein_coding
CY146806.1 Genbank CDS 13 2163 . + 0 ID=cds-AGO00320.1;Parent=gene-PA;Dbxref=NCBI_GP:AGO00320.1;Name=AGO00320.1;gbkey=CDS;gene=PA;product=polymerase PA;protein_id=AGO00320.1
CY146806.1 Genbank gene 13 772 . + . ID=gene-PA-X;Name=PA-X;gbkey=Gene;gene=PA-X;gene_biotype=protein_coding
CY146806.1 Genbank CDS 13 582 . + 0 ID=cds-AGO00321.1;Parent=gene-PA-X;Dbxref=NCBI_GP:AGO00321.1;Name=AGO00321.1;exception=ribosomal slippage;gbkey=CDS;gene=PA-X;product=PA-X protein;protein_id=AGO00321.1
CY146806.1 Genbank CDS 584 772 . + 0 ID=cds-AGO00321.1;Parent=gene-PA-X;Dbxref=NCBI_GP:AGO00321.1;Name=AGO00321.1;exception=ribosomal slippage;gbkey=CDS;gene=PA-X;product=PA-X protein;protein_id=AGO00321.1
TODO: handle/add "exception=ribosomal slippage" to PA-X CDS
gfflu
GFF
##gff-version 3
##sequence-region CY146806 1 2163
CY146806 miniprot gene 13 2163 3758 + . ID=gene-PA;Identity=0.9986;Name=PA;Positive=1.0000;Rank=1;Target=PA%7CCDS%7Cpolymerase_PA%7Cseg3prot 1 716;gene=PA;gene_biotype=protein_coding
CY146806 miniprot CDS 13 2163 3758 . 0 ID=cds-PA;Identity=0.9986;Parent=gene-PA;Rank=1;Target=PA%7CCDS%7Cpolymerase_PA%7Cseg3prot 1 716;gene=PA;product=polymerase PA
CY146806 miniprot gene 13 772 1301 + . Frameshift=1;ID=gene-PA-X;Identity=0.9987;Name=PA-X;Positive=0.9987;Rank=1;Target=PA-X%7CCDS%7CPA-X_protein%7Cseg3prot2C 1 252;gene=PA-X;gene_biotype=protein_coding
CY146806 miniprot CDS 13 772 1301 . 0 Frameshift=1;ID=cds-PA-X;Identity=0.9987;Parent=gene-PA-X;Rank=1;Target=PA-X%7CCDS%7CPA-X_protein%7Cseg3prot2C 1 252;gene=PA-X;product=PA-X protein
Segment 4
Influenza Virus Sequence Annotation Tool output
>Feature MH201222.1
21 1721 gene
gene HA
21 1721 CDS
product hemagglutinin
protein_id MH201222.1p1
function receptor binding and fusion protein
gene HA
21 71 sig_peptide
72 1052 mat_peptide
product HA1
1053 1718 mat_peptide
product HA2
INFO: Length: 1753 nucleotides
INFO: Segment: 4 (HA)
INFO: Sequence completeness: protein 1 - complete; nucleotide - complete
INFO: Serotype: H1
INFO: Virus type: influenza A
NCBI Genbank GFF for MH201222.1
##gff-version 3
#!gff-spec-version 1.21
#!processor NCBI annotwriter
##sequence-region MH201222.1 1 1753
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=11320
MH201222.1 Genbank region 1 1753 . + . ID=MH201222.1:1..1753;Dbxref=taxon:11320;Name=4;gbkey=Src;isolation-source=embyonated chicken eggs;mol_type=viral cRNA;note=laboratory-derived;segment=4;serotype=H1N1;strain=A/PR/8_RGCDC-4%2C6/34
MH201222.1 Genbank gene 21 1721 . + . ID=gene-HA;Name=HA;gbkey=Gene;gene=HA;gene_biotype=protein_coding
MH201222.1 Genbank CDS 21 1721 . + 0 ID=cds-AVY92609.1;Parent=gene-HA;Dbxref=NCBI_GP:AVY92609.1;Name=AVY92609.1;gbkey=CDS;gene=HA;product=hemagglutinin;protein_id=AVY92609.1
MH201222.1 Genbank signal_peptide_region_of_CDS 21 71 . + . ID=id-AVY92609.1:1..17;Parent=cds-AVY92609.1;gbkey=Prot
MH201222.1 Genbank mature_protein_region_of_CDS 72 1052 . + . ID=id-AVY92609.1:18..344;Parent=cds-AVY92609.1;gbkey=Prot;product=HA1
MH201222.1 Genbank mature_protein_region_of_CDS 1053 1718 . + . ID=id-AVY92609.1:345..566;Parent=cds-AVY92609.1;gbkey=Prot;product=HA2
gfflu
GFF
##gff-version 3
##sequence-region MH201222 1 1721
MH201222 miniprot gene 21 1721 2545 + . ID=gene-HA;Identity=0.8233;Name=HA;Positive=0.8993;Rank=1;Target=HA%7CCDS%7Chemagglutinin%7Cseg4protA 1 566;gene=HA;gene_biotype=protein_coding
MH201222 miniprot CDS 21 1721 2545 . 0 ID=cds-HA;Identity=0.8233;Parent=gene-HA;Rank=1;Target=HA%7CCDS%7Chemagglutinin%7Cseg4protA 1 566;gene=HA;product=hemagglutinin
MH201222 feature signal_peptide_region_of_CDS 21 71 . + . ID=signal_peptide-HA;Parent=cds-HA,gene-HA
MH201222 miniprot mature_protein_region_of_CDS 72 1052 1413 + 0 ID=mature_protein-HA;Identity=0.7737;Parent=cds-HA,gene-HA;Rank=1;Target=HA%7Cmature_protein_region_of_CDS%7CHA1%7Cseg4matureA2 1 327;product=HA1
MH201222 miniprot mature_protein_region_of_CDS 1053 1718 1109 + 0 ID=mature_protein-HA;Identity=0.9279;Parent=cds-HA,gene-HA;Rank=1;Target=HA%7Cmature_protein_region_of_CDS%7CHA2%7Cseg4matureA3 1 222;product=HA2
Segment 5
Influenza Virus Sequence Annotation Tool output
>Feature MH085254
44 1540 gene
gene NP
44 1540 CDS
product nucleocapsid protein
protein_id MH085254p1
gene NP
INFO: Length: 1561 nucleotides
INFO: Segment: 5 (NP)
INFO: Sequence completeness: protein 1 - complete; nucleotide - complete
INFO: Virus type: influenza A
NCBI Genbank GFF for MH085254.1
gfflu
GFF
##gff-version 3
##sequence-region MH085254 1 1540
MH085254 miniprot gene 44 1540 2469 + . ID=gene-NP;Identity=0.9438;Name=NP;Positive=0.9819;Rank=1;Target=NP%7CCDS%7Cnucleocapsid_protein%7Cseg5prot 1 498;gene=NP;gene_biotype=protein_coding
MH085254 miniprot CDS 44 1540 2469 . 0 ID=cds-NP;Identity=0.9438;Parent=gene-NP;Rank=1;Target=NP%7CCDS%7Cnucleocapsid_protein%7Cseg5prot 1 498;gene=NP;product=nucleocapsid protein
Segment 6
>Feature EF190976
21 1385 gene
gene NA
21 1385 CDS
product neuraminidase
protein_id EF190976p1
gene NA
INFO: Length: 1413 nucleotides
INFO: Segment: 6 (NA)
INFO: Sequence completeness: protein 1 - complete; nucleotide - complete
INFO: Serotype: N1
INFO: Virus type: influenza A
gfflu
GFF
##gff-version 3
##sequence-region EF190976 1 1385
EF190976 miniprot gene 21 1385 2231 + . ID=gene-NA;Identity=0.8681;Name=NA;Positive=0.9149;Rank=1;Target=NA%7CCDS%7Cneuraminidase%7Cseg6prot1A 1 470;gene=NA;gene_biotype=protein_coding
EF190976 miniprot CDS 21 1385 2231 . 0 ID=cds-NA;Identity=0.8681;Parent=gene-NA;Rank=1;Target=NA%7CCDS%7Cneuraminidase%7Cseg6prot1A 1 470;gene=NA;product=neuraminidase
Segment 7
Influenza Virus Sequence Annotation Tool output
>Feature MH085255
24 782 gene
gene M1
24 782 CDS
product matrix protein 1
protein_id MH085255p1
gene M1
24 1005 gene
gene M2
24 49 CDS
738 1005
product matrix protein 2
protein_id MH085255p2
gene M2
INFO: Length: 1023 nucleotides
INFO: Segment: 7 (MP)
INFO: Sequence completeness: protein 1 - complete; protein 2 - complete; nucleotide - complete
INFO: This sequence (MH085255) contains following signature mutation(s) that might confer amantadine resistance: (V27A) (S31N)
INFO: Virus type: influenza A
gfflu
GFF
##gff-version 3
##sequence-region MH085255 1 1005
MH085255 miniprot gene 24 782 1238 + . ID=gene-M1;Identity=0.9683;Name=M1;Positive=0.9881;Rank=1;Target=M1%7CCDS%7Cmatrix_protein_1%7Cseg7prot1 1 252;gene=M1;gene_biotype=protein_coding
MH085255 miniprot CDS 24 782 1238 . 0 ID=cds-M1;Identity=0.9683;Parent=gene-M1;Rank=1;Target=M1%7CCDS%7Cmatrix_protein_1%7Cseg7prot1 1 252;gene=M1;product=matrix protein 1
MH085255 miniprot gene 24 1005 435 + . ID=gene-M2;Identity=0.8454;Name=M2;Positive=0.9072;Rank=1;Target=M2%7CCDS%7Cmatrix_protein_2%7Cseg7prot2A 1 97;gene=M2;gene_biotype=protein_coding
MH085255 miniprot CDS 24 49 41 + 0 ID=cds-M2;Identity=1.0000;Parent=gene-M2;Rank=1;Target=M2%7CCDS%7Cmatrix_protein_2%7Cseg7prot2A 1 8;gene=M2;product=matrix protein 2
MH085255 miniprot CDS 738 1005 394 . 1 ID=cds-M2;Identity=0.8295;Parent=gene-M2;Rank=1;Target=M2%7CCDS%7Cmatrix_protein_2%7Cseg7prot2A 9 97;gene=M2;product=matrix protein 2
Segment 8
Influenza Virus Sequence Annotation Tool output
>Feature MH085256
25 717 gene
gene NS1
25 717 CDS
product nonstructural protein 1
protein_id MH085256p1
gene NS1
25 862 gene
gene NEP
gene_syn NS2
25 54 CDS
527 862
product nuclear export protein
note nonstructural protein 2
protein_id MH085256p2
gene NEP
INFO: Length: 886 nucleotides
INFO: Segment: 8 (NS)
INFO: Sequence completeness: protein 1 - complete; protein 2 - complete; nucleotide - complete
INFO: Virus type: influenza A
gfflu
GFF
##gff-version 3
##sequence-region MH085256 1 862
MH085256 miniprot gene 25 862 553 + . ID=gene-NS2;Identity=0.9174;Name=NS2;Positive=0.9339;Rank=1;Target=NS2%7CCDS%7Cnonstructural_protein_2%7Cseg8prot2A 1 121;gene=NS2;gene_biotype=protein_coding
MH085256 miniprot CDS 25 54 44 + 0 ID=cds-NS2;Identity=0.9000;Parent=gene-NS2;Rank=1;Target=NS2%7CCDS%7Cnonstructural_protein_2%7Cseg8prot2A 1 10;gene=NS2;product=nonstructural protein 2
MH085256 miniprot CDS 527 862 509 . 0 ID=cds-NS2;Identity=0.9189;Parent=gene-NS2;Rank=1;Target=NS2%7CCDS%7Cnonstructural_protein_2%7Cseg8prot2A 11 121;gene=NS2;product=nonstructural protein 2
MH085256 miniprot gene 25 717 1074 + . ID=gene-NS1;Identity=0.9130;Name=NS1;Positive=0.9565;Rank=1;Target=NS1%7CCDS%7Cnonstructural_protein_1%7Cseg8prot1J 1 230;gene=NS1;gene_biotype=protein_coding
MH085256 miniprot CDS 25 717 1074 . 0 ID=cds-NS1;Identity=0.9130;Parent=gene-NS1;Rank=1;Target=NS1%7CCDS%7Cnonstructural_protein_1%7Cseg8prot1J 1 230;gene=NS1;product=nonstructural protein 1
License
gfflu
is distributed under the terms of the MIT license.
References
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file gfflu-0.0.2.tar.gz
.
File metadata
- Download URL: gfflu-0.0.2.tar.gz
- Upload date:
- Size: 51.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/4.0.2 CPython/3.11.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c327a59786500ae714a5d16cb87a7019c120e4d2a1764ac06a79e6a87de5d9fc |
|
MD5 | db9832a959daca9abfb0ba727c3f8aac |
|
BLAKE2b-256 | 6a84c21c3f637d9db5cc558761894961db2306adf2a60fa37a5313a651e58e5d |
File details
Details for the file gfflu-0.0.2-py3-none-any.whl
.
File metadata
- Download URL: gfflu-0.0.2-py3-none-any.whl
- Upload date:
- Size: 42.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/4.0.2 CPython/3.11.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2c755c609a9eb6f19ff25606b64e3953f9f8ccdabe556ec2c7a79be02c3b5a2e |
|
MD5 | c5c761d50f98a3f1c9574525397fb921 |
|
BLAKE2b-256 | a58a71064fba627930ee0ad78ba4bbeb613cbd833326deb09862f23574c42471 |