A tool to annotate microbial genomes
Project description
Introduction
Genotate is a tool to annotate prokaryotic* and phage genomes. It uses scrolling amino-acid windows in all six frames to distinguish between windows that belong to protein coding gene regions and those that belong to noncoding regions, in order to determine the coding frame at every position along the genome. *(the bacteria/archaea model is still being trained)
To install Genotate
,
pip install genotate
And to run Genotate
you only need to specify the FASTA formatted genome file
The command to run using the phage models on the provided phiX174 genome is:
genotate.py test/phiX174.fasta -o predictions.gb
The command to run using the partially trained bacterial/archaeal models needs the --bacterial flag. Instead of a FASTA formatted file, you can provide a Genbank formatted file and Genotate will use only the genomic sequence.
genotate.py test/mycoplasma.gbff.gz -o predictions.gb --bacteria
It is recommended to use a GPU to run Genotate since it will take a long time to for prokaryotic genomes. Genotate will automatically try to run on GPU, if one isn't found it will run on a CPU.
The output of Genotate
are 'coding region' predictions in GenBank format. They should
match with the true coding gene regions, but are not genes per say, since they are not based
on start and stop codons. Though they have all been trimmed to a stop codon after Genotate
determines which transation table the genome uses (i.e. if it performs stop codon readthrough).
There are three main phases to the Genotate workflow
- window classification
- change-point detection
- refinement
- analyze stop codons
- merge adjacent regions
- split regions on stop
- adjust ends to a stop
Genotate determines the translation table by analyzing the initial coding gene region predictions. There are two outcomes for a stop codon that is readthrough: either the stop codon appears in the middle of a coding gene region or the region is broken into two pieces at the stop codon. If one of the three known stop codons is significantly over represented in the middle AND between predicted gene regions, that stop codon can be assumed to be read through. With the stop codon usage now known, same frame adjacent coding regions are merged if there is not a stop codon between them. Then the regions are split on any internal stop codons and the ends adjusted to the nearest stop codon.
** The opposite end is not adjusted to valid start codon since Genotate does not have a translation initiation site detection method yet, so the beginning of a gene call may be off by a few codons
Currently the best way to visualize the predictions is in a Genome Viewer application, such as Artemis by Sanger. The example phiX174.gb GenBank file loaded into Artemis shows the gene layout:
The predictions.gb file can then be loaded using the 'File>Read An Entry' menu, and the predictions will be overlaid as grey 'coding regions' in the gene layout window:
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file genotate-0.15.tar.gz
.
File metadata
- Download URL: genotate-0.15.tar.gz
- Upload date:
- Size: 57.5 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.17
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | cbb90cd85d90bbb4ebf260361b99e12fa65534b6cf90779db4c1563628bf7eb7 |
|
MD5 | f0942059d91a0866420530f051a7ea7a |
|
BLAKE2b-256 | 6185ebd72c3048869316fcd2a43485c7c5eec5d8c3a8c610435818fa800bb9e3 |
File details
Details for the file genotate-0.15-cp39-cp39-macosx_12_0_x86_64.whl
.
File metadata
- Download URL: genotate-0.15-cp39-cp39-macosx_12_0_x86_64.whl
- Upload date:
- Size: 57.5 MB
- Tags: CPython 3.9, macOS 12.0+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.17
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5e8c454944af66c5fc439de56537fae52b838e98e4a3edd67fc294471c3e58f1 |
|
MD5 | 3703722e36a927474a70800c3643cd7c |
|
BLAKE2b-256 | 6460ee5666644ec059d7c13bcb28772ddb2af39be5e5a3d43ca8cbf00719ad39 |