Skip to main content

A tool to annotate microbial genomes

Project description

Introduction

Genotate is a tool to annotate prokaryotic* and phage genomes. It uses scrolling amino-acid windows in all six frames to distinguish between windows that belong to protein coding gene regions and those that belong to noncoding regions, in order to determine the coding frame at every position along the genome. *(the bacteria/archaea model is still being trained)

To install Genotate,

 pip install genotate

And to run Genotate you only need to specify the FASTA formatted genome file The command to run using the phage models on the provided phiX174 genome is:

 genotate.py test/phiX174.fasta -o predictions.gb

The command to run using the partially trained bacterial/archaeal models needs the --bacterial flag. Instead of a FASTA formatted file, you can provide a Genbank formatted file and Genotate will use only the genomic sequence.

 genotate.py test/mycoplasma.gbff.gz -o predictions.gb --bacteria

It is recommended to use a GPU to run Genotate since it will take a long time to for prokaryotic genomes. Genotate will automatically try to run on GPU, if one isn't found it will run on a CPU.


The output of Genotate are 'coding region' predictions in GenBank format. They should match with the true coding gene regions, but are not genes per say, since they are not based on start and stop codons. Though they have all been trimmed to a stop codon after Genotate determines which transation table the genome uses (i.e. if it performs stop codon readthrough).

There are three main phases to the Genotate workflow

  1. window classification
  2. change-point detection
  3. refinement
    • analyze stop codons
    • merge adjacent regions
    • split regions on stop
    • adjust ends to a stop

Genotate determines the translation table by analyzing the initial coding gene region predictions. There are two outcomes for a stop codon that is readthrough: either the stop codon appears in the middle of a coding gene region or the region is broken into two pieces at the stop codon. If one of the three known stop codons is significantly over represented in the middle AND between predicted gene regions, that stop codon can be assumed to be read through. With the stop codon usage now known, same frame adjacent coding regions are merged if there is not a stop codon between them. Then the regions are split on any internal stop codons and the ends adjusted to the nearest stop codon.

** The opposite end is not adjusted to valid start codon since Genotate does not have a translation initiation site detection method yet, so the beginning of a gene call may be off by a few codons

Currently the best way to visualize the predictions is in a Genome Viewer application, such as Artemis by Sanger. The example phiX174.gb GenBank file loaded into Artemis shows the gene layout:

The predictions.gb file can then be loaded using the 'File>Read An Entry' menu, and the predictions will be overlaid as grey 'coding regions' in the gene layout window:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

genotate-0.15.tar.gz (57.5 MB view details)

Uploaded Source

Built Distribution

genotate-0.15-cp39-cp39-macosx_12_0_x86_64.whl (57.5 MB view details)

Uploaded CPython 3.9 macOS 12.0+ x86-64

File details

Details for the file genotate-0.15.tar.gz.

File metadata

  • Download URL: genotate-0.15.tar.gz
  • Upload date:
  • Size: 57.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.17

File hashes

Hashes for genotate-0.15.tar.gz
Algorithm Hash digest
SHA256 cbb90cd85d90bbb4ebf260361b99e12fa65534b6cf90779db4c1563628bf7eb7
MD5 f0942059d91a0866420530f051a7ea7a
BLAKE2b-256 6185ebd72c3048869316fcd2a43485c7c5eec5d8c3a8c610435818fa800bb9e3

See more details on using hashes here.

File details

Details for the file genotate-0.15-cp39-cp39-macosx_12_0_x86_64.whl.

File metadata

File hashes

Hashes for genotate-0.15-cp39-cp39-macosx_12_0_x86_64.whl
Algorithm Hash digest
SHA256 5e8c454944af66c5fc439de56537fae52b838e98e4a3edd67fc294471c3e58f1
MD5 3703722e36a927474a70800c3643cd7c
BLAKE2b-256 6460ee5666644ec059d7c13bcb28772ddb2af39be5e5a3d43ca8cbf00719ad39

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page