Translate DNA sequences to protein sequences using different genetic codes and translation tables

These details have not been verified by PyPI

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Environment
- Console
- MacOS X
Intended Audience
- Science/Research
License
- OSI Approved :: MIT License
Natural Language
- English
Operating System
- POSIX :: Linux
Programming Language
Topic
- Scientific/Engineering :: Bio-Informatics

Project description

GitHub language count

Genetic Codes

A pure Python library with no imports for translating DNA sequences into protein sequences using different translation tables (aka genetic codes).

The NCBI Genetic Codes are central to working with alternate genetic codes. This Python tool kit includes a library that exposes the genetic codes so you can query a codon and get its variants or query a code and get its table.

Installation

You can install pygenetic_code with pip.

pip install pygenetic_code
pygenetic_code --version

A conda installation is coming.

Usage

Translating sequences

We have some example applications that show you how to translate DNA sequences in all six reading frames.

First, make sure you have a DNA sequence. We provide a few in tests/ including a very short sequence, crAssphage, and [E. coli])(tests/U00096.3.fna.gz).

Then, you can use the example code to translate that sequence using the bacterial genetic code (translation table 11):

python examples/translate_sequence_in_all_frames.py -f tests/JQ995537.fna -t 11

or an alternate genetic code (translation table 15):

python examples/translate_sequence_in_all_frames.py -f tests/JQ995537.fna -t 15

I have also included the E. coli K-12 sequence, and so you can identify all the ORFs in that genome:

python examples/translate_sequence_in_all_frames.py -f tests/U00096.3.fna.gz -t 11

(yes, you can use gzip files without decompressing them).

This will take about 0.1 seconds to do the actual translation, but starting python and all the other overheads make it almost 3/4 second to run.

You can also look at the effect of translation tables on the same sequences by running

python examples/average_translation_length.py -f tests/JQ995537.fna # for crassphage
python examples/average_translation_length.py -f tests/U00096.3.fna.gz # for E. coli K-12

Library

Translating sequences

You can import the C library by importing PyGeneticCode.

There are two main methods that you can call:

The first function just returns the translation of your DNA sequence in 5' -> 3' format, so for example, this is the method you might use to translate an ORF.

PyGeneticCode.translate_one_frame(DNA\_sequence, translation\_table, verbose)

(See examples/translate_asequence.py for an example.

The second method returns all the 6 frame translations.

PyGeneticCode.translate(DNA\_sequence, translation\_table, verbose)

(See examples/translate_sequence_in_all_frames.py for an example invocation.)

The DNA sequence is the DNA sequence you want to translate. The translation table must be one of the valid translation tables (see pygenetic_code/genetic_code.translation_tables for the valid tables).

Translate a codon

Another way to access the code in your python application is to access the translate_codon() function, that has this signature:

amino_acid = translate_codon(codon, translation_table=1, one_letter=False)

The codon is the codon that you want to translate as either an RNA (e.g. AUG) or DNA (e.g. ATG) sequence. The translation_table is your required translation table (see the NCBI website for valid tables), and one_letter is whether to return a three letter amino acid code (e.g. Met or Ter) or a one letter amino acid code (e.g. M or *).

The library provides other ways to access the genetic codes, and those are exemplified in the pytest files in tests/

Standalone

You can just print translation tables using the pygenetic_code command. There are currently a couple of options:

json prints the table in machine readable json format.
difference prints a .tsv file with the the difference from the standard (translation table 1) code
maxdifference prints a .tsv file with the difference from the most common amino acid. The main difference is that TGA is more frequently tryptophan than a stop.

Citing

Please cite this repository as:

Edwards, Robert A. 2023. pygenetic_code. https://github.com/linsalrob/genetic_codes

A full DOI citation is coming soon.

Project details

These details have not been verified by PyPI

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Environment
- Console
- MacOS X
Intended Audience
- Science/Research
License
- OSI Approved :: MIT License
Natural Language
- English
Operating System
- POSIX :: Linux
Programming Language
Topic
- Scientific/Engineering :: Bio-Informatics

Release history Release notifications | RSS feed

0.20.0

Jan 4, 2024

This version

0.16.0

Jan 3, 2024

0.14

Jan 2, 2024

0.13

Dec 24, 2023

0.12

Dec 24, 2023

0.1

Dec 24, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pygenetic_code-0.16.0.tar.gz (1.5 MB view hashes)

Uploaded Jan 3, 2024 Source

Built Distribution

pygenetic_code-0.16.0-cp312-cp312-manylinux_2_35_x86_64.whl (24.9 kB view hashes)

Uploaded Jan 3, 2024 CPython 3.12 manylinux: glibc 2.35+ x86-64

Hashes for pygenetic_code-0.16.0.tar.gz

Hashes for pygenetic_code-0.16.0.tar.gz
Algorithm	Hash digest
SHA256	`7ae6b17beeb92474022e67ac33264277af7c15fa45aa5e5eed6eb85ca1e0c9ab`
MD5	`cd828f2eae75fa965371a5877cca7b74`
BLAKE2b-256	`0235313be3eee9901c3f33c6a181ea66780b7d8a3c9e45c94335765f20e7d708`

Hashes for pygenetic_code-0.16.0-cp312-cp312-manylinux_2_35_x86_64.whl

Hashes for pygenetic_code-0.16.0-cp312-cp312-manylinux_2_35_x86_64.whl
Algorithm	Hash digest
SHA256	`baa8cfe80f67777103b6b65d12d185df0288d3e592f5c1b3387d1a0e235d0948`
MD5	`4ac112155ec70b6a104cba183fd32746`
BLAKE2b-256	`96568807d2e99754f6fb3b42e7f222645de44b3236175932883b81034e5187bc`