Translation-guided nucleotide alignment for coding sequences

Project description

Conda Version PyPI - Version GitHub License

Translation-guided alignment of nucleotide sequences

Several established tools for translation-guided codon alignment are no longer maintained or available for download, e.g. TranslatorX or pal2nal, or may need to be ported to new language or dependency versions to work properly, e.g. transAlign.

This package reimplements some features of the above programs to perform simple translation-guided nucleotide (codon) alignments, and to screen for pseudogenes with frameshift indels or non-sense substitutions.

The tool can be used to perform alignment or simply report sequence statistics and flag potential pseudogenes. The intended use case is to screen and align collections of PCR-amplified coding sequences used for metabarcoding, e.g. the mitochondrial cytochrome c oxidase subunit I (mtCOI) gene fragment.

How the alignment works

Reading frame can be manually specified or guessed with a heuristic. Genetic code must be manually specified; heuristic to guess genetic code is not yet implemented.

Choose reading frames for translation:

Reading frame can be chosen in one of three ways (specified to option --how):
- User-defined frame offset applied to all sequences (--how user)
- Apply same frame to all sequences, choose consensus frame that minimizes total number of stop codons across all sequences (--how cons)
- Choose frame individually for each sequence that minimizes stop codons for that sequence; may result in ties where a sequence may have more than one 'best' reading frame (--how each)
Sequences that have more than the maximum allowed number of stop codons in any reading frame are flagged as putative pseudogenes.
The 'good' sequences are translated in the reading frame as chosen above.
If there is more than one reading frame with zero stop codons, the two (or three) alternative translations are each pairwise aligned to the remaining sequences with an unambiguous best reading frame. The frame that has the highest total alignment score is chosen.
Optional: If an HMM representing the target protein sequence is provided (option --hmm), the 'good' sequences will be screened against this HMM; sequences with outlier bit scores will not be used for the initial aligment
Translated 'good' sequences are aligned with MAFFT; nucleotide sequences are then aligned as codons using this amino acid alignment

Dealing with pseudogenes/frameshifted sequences (adapted from transAlign, see Bininda-Edmonds, 2005):

Nucleotide sequences of putative pseudogenes are then aligned against the reference 'good' alignment with MAFFT --add option
Likely frameshift positions in putative pseudogenes are reported from the positional map of the reference-guided alignment

Assumptions

Input sequences are homologous
Input sequences are protein coding sequences without introns or untranslated regions
Input sequences are long enough that wrong reading frame will be evident in excessive stop codons (warning if average sequence length is under 50)
If pseudogenes are present, majority of sequences are not pseudogenes (warning if more than half of sequences have excessive stop codons)
Sequences all use the same genetic code

For a more careful alignment, or for sequence sets with many frameshifted sequences, use MACSE instead, however MACSE is quite slow for de novo alignments and is probably overkill for most "normal" datasets where most sequences do not have frameshifts.

Installation

From PyPI

Install from PyPI with pip, preferably into a virtualenv:

python -m venv /path/to/env
source /path/to/env/bin/activate
pip install pytransaln

External dependencies are not installed via pip, but should also be in path:

MAFFT >=6.811; tested with v7.520.

From Bioconda

Using Conda or Mamba, preferably to a new environment:

mamba create -p /path/to/env -c conda-forge -c bioconda -c defaults --strict-channel-priority pytransaln

Refer to the Bioconda documentation for details on channel priority.

Usage

See help message for details

pytransaln --help

It is recommended to inspect the alignment afterwards or apply quality checks with other programs such as trimAl.

To view alignments on the command line you can use alv and pipe to less with the -R option:

alv -t codon -l alignment.fasta | less -R

Output alignment

Note that:

Leading and trailing bases not contained in complete codons may be omitted
Portions of the putative pseudogene sequences that were aligned in the second step to the initial 'good' alignment may be omitted from the final 'augmented' alignment to preserve the reading frame. See the frameshift report file for details.

Testing and benchmarking

Commands to run tests with example data (from the benchmark data sets distributed with transAlign) are in the Makefile:

make help # list available commands
make install # install to current path/environment
make benchmark # download test data and run alignments
make clean # delete benchmark output

Future enhancements

In order of priority

Add pre and post frame sequence back to alignment
Guess genetic code
Translate 6 frames
User-supplied input amino acid alignment

Project details

Release history Release notifications | RSS feed

This version

0.2.2

Nov 25, 2025

0.2.1

Oct 17, 2023

0.2.0

Oct 5, 2023

0.1.0

Sep 29, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pytransaln-0.2.2.tar.gz (18.2 kB view details)

Uploaded Nov 25, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pytransaln-0.2.2-py3-none-any.whl (18.1 kB view details)

Uploaded Nov 25, 2025 Python 3

File details

Details for the file pytransaln-0.2.2.tar.gz.

File metadata

Download URL: pytransaln-0.2.2.tar.gz
Upload date: Nov 25, 2025
Size: 18.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for pytransaln-0.2.2.tar.gz
Algorithm	Hash digest
SHA256	`c96db91c9a9143224cf765088f0e7c7fa563d4bf576e6252f3b7467ed0444255`
MD5	`96ac6b6d428c6f5295aad0ac921686bb`
BLAKE2b-256	`ea600ec0a4176da45eec9d6e5aa314f37b0ea93b0480e1d99b6d7a09ccb6a9d1`

See more details on using hashes here.

File details

Details for the file pytransaln-0.2.2-py3-none-any.whl.

File metadata

Download URL: pytransaln-0.2.2-py3-none-any.whl
Upload date: Nov 25, 2025
Size: 18.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for pytransaln-0.2.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`393a3812ed03bfa439eb55f24f6138798425889a3fb0e4ef90a3651e7b2019d3`
MD5	`44e22bf5738de21bfbe0c27b62951153`
BLAKE2b-256	`5a51d44dac317b755c4be11ebcb4e450bd700f07c9eda8226a6c8d17e4b41486`

See more details on using hashes here.

pytransaln 0.2.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

Translation-guided alignment of nucleotide sequences

How the alignment works

Assumptions

Installation

From PyPI

From Bioconda

Usage

Output alignment

Testing and benchmarking

Future enhancements

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes