StORF-Reporter - A tool that takes as input an annotated genome and returns missed (Stop-to-Stop) CDS genes from the unannotated regions.
Project description
StORF-Reporter - Preprint: https://www.biorxiv.org/content/10.1101/2022.03.31.486628v3
StORF-Reporter, a toolkit that takes as input an annotated genome and returns missed CDS genes from the Unannotated Regions (URs).
Please use `pip3 install StORF-Reporter' to install StORF-Reporter.
This will also install numpy and the 'ORForise' package from https://github.com/NickJD/ORForise to allow for additional functionality.
Consider using '--no-cache-dir' with pip to ensure the download of the newest version of the package.
Please Note: To report Con-StORFs (Pseudogenes and genes that reuse stop codons), use "-con_storfs True" and "-con_only" to disable reporting of StORFs (missed complete genes).
#############################################################
UR-Extractor
Subpackage to extract Unannotated Regions from DNA sequences using FASTA and GFF files as input.
Menu - (UR-Extractor -h):
UR-Extractor -f genomes/E-coli.fasta.gz -gff genomes/E-coli.gff -o genomes/E-coli_UR -gz True
usage: UR_Extractor.py [-h] -f FASTA -gff GFF [-ident IDENT] [-min_len MINLEN] [-max_len MAXLEN] [-ex_len EXLEN] [-gene_ident GENE_IDENT] [-o OUT_FILE] [-gz {True,False}] [-v {True,False}]
StORF-Reporter v0.6.0: UR-Extractor Run Parameters.
Required Arguments:
-f FASTA FASTA file for Unannotated Region seq extraction
-gff GFF GFF annotation file for the FASTA
Optional Arguments:
-ident IDENT Identifier given for Unannotated Region output sequences - Do not modify if output is to be used by StORF-Finder: Default "Sequence-ID"_UR
-min_len MINLEN Minimum UR Length: Default 30
-max_len MAXLEN Maximum UR Length: Default 100,000
-ex_len EXLEN UR Extension Length on 5' and 3': Default 50
-gene_ident GENE_IDENT
Identifier used for extraction of Unannotated Regions "CDS,rRNA,tRNA": Default for Ensembl_Bacteria = "ID=gene" or "-gene_ident CDS" for "most" genome annotations
Output:
-o OUT_FILE Output file - Without filetype - default appends "_UR" to end of input gff filename (replaces '.gff')
-gz {True,False} Default - False: Output as .gz
Misc:
-v {True,False} Default - False: Print out runtime status
StORF-Finder
Subpackage to extract Stop - Stop Codon (St)ORFs from Fasta sequences - Worked directly with the output of UR-Extractor.
Menu - (StORF-Finder -h):
StORF-Finder -f genomes/E-coli_UR.fasta.gz -o genomes/E-coli_UR_StORF
usage: StORF_Finder.py [-h] -f FASTA [-ua {True,False}] [-wc {True,False}] [-ps {True,False}] [-olap_filt [{none,single-strand,both-strand}]] [-start_filt {True,False}] [-con_storfs {True,False}] [-con_only {True,False}]
[-short_storfs {False,Nolap,Olap}] [-short_storfs_only {True,False}] [-stop_ident {True,False}] [-type [{StORF,CDS,ORF}]] [-minorf MIN_ORF] [-maxorf MAX_ORF] [-codons STOP_CODONS] [-olap OVERLAP_NT] [-s SUFFIX]
[-so [{start_pos,strand}]] [-spos {True,False}] [-o OUT_FILE] [-af AFFIX] [-gff {True,False}] [-aa {True,False}] [-aa_only {True,False}] [-lw {True,False}] [-gff_fasta {True,False}] [-gz {True,False}]
[-v {True,False}]
StORF-Reporter v0.6.0: StORF-Finder Run Parameters.
Required Arguments:
-f FASTA Input FASTA File - (UR_Extractor output)
Optional Arguments:
-ua {True,False} Default - Treat input as Unannotated: Use "-ua False" for standard fasta
-wc {True,False} Default - False: StORFs reported across entire sequence
-ps {True,False} Default - False: Partial StORFs reported
-olap_filt [{none,single-strand,both-strand}]
Default - "both-strand": Filtering level "none" is not recommended, "single-strand" for single strand filtering and both-strand for both-strand longest-first tiling
-start_filt {True,False}
Default - False: Filter out StORFs without at least one of the 3 common start codons (best used for short-storfs).
-con_storfs {True,False}
Default - False: Output Consecutive StORFs
-con_only {True,False}
Default - False: Only output Consecutive StORFs
-short_storfs {False,Nolap,Olap}
Default - False: Run StORF-Finder in "Short-StORF" mode. Will only return StORFs between 30 and 120 nt that do not overlap longer StORFs - Only works with StORFs for now. "Nolap" will filter Short-StORFs which
areoverlapped by StORFs and Olap will report Short-StORFs which do overlap StORFs. Overlap is defined by "-olap".
-short_storfs_only {True,False}
Default - True. Only report Short-StORFs?
-stop_ident {True,False}
Default - True: Identify Stop Codon positions with '*'
-type [{StORF,CDS,ORF}]
Default - "StORF": Which GFF feature type for StORFs to be reported as in GFF
-minorf MIN_ORF Default - 100: Minimum StORF size in nt
-maxorf MAX_ORF Default - 50kb: Maximum StORF size in nt
-codons STOP_CODONS Default - ('TAG,TGA,TAA'): List Stop Codons to use
-olap OVERLAP_NT Default - 50: Maximum number of nt of a StORF which can overlap another StORF.
-s SUFFIX Default - Do not append suffix to genome ID
-so [{start_pos,strand}]
Default - Start Position: How should StORFs be ordered when >1 reported in a single UR.
-spos {True,False} Default - False: Print out StORF positions inclusive of first stop codon
-o OUT_FILE Default - False: Without filetype - default appends '_StORF-R' to end of input gff filename (replaces '.gff')
-af AFFIX Default - None: '-af Con-StORFs' can be used to append an identifier to output filename to distinguish Con-StORF from StORF runs)
Output:
-gff {True,False} Default - True: Output a GFF file
-aa {True,False} Default - False: Report StORFs as amino acid sequences
-aa_only {True,False}
Default - False: Only output Amino Acid Fasta
-lw {True,False} Default - True: Line wrap FASTA sequence output at 60 chars
-gff_fasta {True,False}
Default - False: Report all gene sequences (nt) at the bottom of GFF files in PROKKA output mode
-gz {True,False} Default - False: Output as .gz
Misc:
-v {True,False} Default - False: Print out runtime status
##########
StORF-Reporter
Subpackage to run UR-Extractor and StORF-Finder together on a PROKKA annotation and produce output which can be used in tools such as Roary that by default accepts PROKKA's output format:
For use on a single PROKKA output directory -
StORF-Reporter -anno PROKKA -pd ../PROKKA_04062022/
For use on a directory containing multiple PROKKA output gffs
StORF-Reporter -anno PROKKA -p_gff ../PROKKA_Outputs/
Menu - (StORF-Reporter -h):
usage: StORF_Reporter.py [-h] -anno [{PROKKA,Ensembl,Gene}] -dt [{PROKKA_Out,PROKKA_GFFs}] -pd PROKKA_DIR [-comb COMBINED_GFFS] [-spos {True,False}] [-rs {True,False}] [-sout {True,False}] [-con_storfs {True,False}]
[-con_only {True,False}] [-short_storfs {False,Nolap,Olap}] [-short_storfs_only {True,False}] [-min_len MINLEN] [-max_len MAXLEN] [-ex_len EXLEN] [-minorf MIN_ORF] [-olap_filt [{none,single-strand,both-strand}]]
[-start_filt {True,False}] [-type [{StORF,CDS,ORF}]] [-olap OVERLAP_NT] [-ao ALLOWED_OVERLAP] [-lw {True,False}] [-aa {True,False}] [-gff_fasta {True,False}] [-gz {True,False}] [-v {True,False}]
StORF-Reporter v0.6.0: StORF-Reporter Run Parameters.
Required Arguments:
-anno [{PROKKA,Ensembl,Gene}]
Annotation type to be StORF-Reported - Options: PROKKA = "misc_RNA,gene,mRNA,CDS,tRNA,tmRNA,CRISPR";Ensembl = "ID=gene" ;Gene = "gene"
-dt [{PROKKA_Out,PROKKA_GFFs}]
Type of directory to be loaded by StORF-Reported: Options: PROKKA_Out = Single PROKKA output directory; PROKKA_GFFs = Directory containing multiple PROKKA GFF files
-pd PROKKA_DIR PROKKA output directory to be used - Produces a new GFF and FASTA containing all Coding and Non-Coding Seqs for each PROKKA output GFF
Optional Arguments:
-comb COMBINED_GFFS Provide directory containing GFFs with sequences combined into single file to be StORFed - Only produces modified GFFs
-spos {True,False} Default - False: Print out StORF positions inclusive of first stop codon
-rs {True,False} Default - True: Remove stop "*" from StORF amino acid sequences
-sout {True,False} Default - False: Print out StORF sequences separately?
-con_storfs {True,False}
Default - False: Output Consecutive StORFs
-con_only {True,False}
Default - False: Only output Consecutive StORFs
-short_storfs {False,Nolap,Olap}
Default - False: Run StORF-Finder in "Short-StORF" mode. Will only return StORFs between 30 and 120 nt that do not overlap longer StORFs - Only works with StORFs for now. "Nolap" will filter Short-StORFs which
areoverlapped by StORFs and Olap will report Short-StORFs which do overlap StORFs. Overlap is defined by "-olap".
-short_storfs_only {True,False}
Default - True. Only report Short-StORFs?
-min_len MINLEN Default - 30: Minimum UR Length
-max_len MAXLEN Default - 100,000: Maximum UR Length
-ex_len EXLEN Default - 50: UR Extension Length
-minorf MIN_ORF Default - 100: Minimum StORF size in nt
-olap_filt [{none,single-strand,both-strand}]
Default - "both-strand": Filtering level "none" is not recommended, "single-strand" for single strand filtering and both-strand for both-strand longest-first tiling
-start_filt {True,False}
Default - False: Filter out StORFs without at least one of the 3 common start codons (best used for short-storfs).
-type [{StORF,CDS,ORF}]
Default - "CDS": Which GFF feature type for StORFs to be reported as in GFF - "CDS" is probably needed for use in tools such as Roary
-olap OVERLAP_NT Default - 50: Maximum number of nt of a StORF which can overlap another StORF.
-ao ALLOWED_OVERLAP Default - 50 nt: Maximum overlap between a StORF and an original gene.
Output::
-lw {True,False} Default - True: Line wrap FASTA sequence output at 60 chars
-aa {True,False} Default - False: Report StORFs as amino acid sequences
-gff_fasta {True,False}
Default - False: Report all gene sequences (nt) at the bottom of GFF files in PROKKA output mode
-gz {True,False} Default - False: Output as .gz
Misc::
-v {True,False} Default - False: Print out runtime status
Test Datasets:
The direcotory 'Genomes' contains GFF and FASTA datasets to test a users installation and use of StORF-Reporter. Example output files are also available for comparison.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
StORF-Reporter-0.6.0.tar.gz
(39.0 kB
view hashes)
Built Distribution
Close
Hashes for StORF_Reporter-0.6.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 09538d92cf92d8f86fd1b0a1130f4df188a27ad8ccffe122c7b7f237ca10ba6f |
|
MD5 | d7ec4fb327caf4dd5d34412d12b6502c |
|
BLAKE2b-256 | 4f61bf6d3d7f3a9b6aec460fc0f4d35ed9f6c07adbf90c6fcbb03eaf00280c9a |