A Python tool to extract and align genes, introns, and intergenic spacers across hundreds of plastid genomes using associative arrays

These details have not been verified by PyPI

Project links

Homepage

Project description

plastburstalign

A Python tool to extract and align genes, introns, and intergenic spacers across thousands of plastid genomes using associative arrays

Purpose

This software tool is designed for large-scale quality assessment of organellar genome annotations. It detects annotation discrepancies in large genome datasets by comparing sequence and annotation features through automated multiple sequence alignments across homologous regions.

Background

The multiple sequence alignment (MSA) of a set of plastid genomes is challenging. At least five factors contribute to this challenge:

First, the plastid genome is a mosaic of individual genome regions. An MSA procedure must identify, extract, group, and align homologous regions across the input genomes.
Second, many plastid genomes contain annotation errors in gene positions and/or gene names. An MSA procedure must automatically exclude incorrectly annotated regions from the alignment procedure.
Third, plastid genomes comprise both coding and noncoding regions, which require different alignment strategies (e.g., amino acid-based for genes, nucleotide-based for introns and intergenic spacers). An MSA procedure must apply the appropriate strategy automatically.
Fourth, modern plastid genome studies often involve hundreds, if not thousands, of complete genomes. An MSA procedure must perform sequence alignment within practical time frames (e.g., hours rather than days).
Fifth, manually excluding user-specified genome regions after alignment is prohibitively complex. An MSA procedure must support the automatic exclusion of user-specified regions before the alignment starts.

The software plastburstalign addresses these and other challenges: it provides an MSA procedure that extracts and aligns genes, introns, and intergenic spacers across hundreds or thousands of plastid genomes in an autonomous fashion.

Overview of process

Depiction of plastomes being split according to specified marker type; the extracted sequences are then aligned and concatenated

Main features

Extraction of all genome regions from set of input plastid genomes, followed by grouping and alignment of the extracted regions:
- genes (cds)
- introns (int)
- intergenic spacers (igs)
Support for multiple alignment tools:
- MAFFT
- MUSCLE
- Clustal Omega
Automatic exon splicing:
- automatic merging of all exons of any cis-spliced gene
- automatic grouping of all exons of any trans-spliced gene (e.g., rps12), followed by merging of adjacent exons [see ExonSpliceHandler for both]
Automatic quality control to evaluate if extracted genes are complete (i.e., valid start and stop codon present)
Automatic removal of any duplicate regions (i.e., relevant for regions duplicated through the IRs)
Removal of user-specified regions:
- regions that do not fulfill a minimum, user-specified sequence length
- regions that do not fulfill a minimum, user-specified number of taxa of the dataset that the region must be found in [see DataCleaning for both]
- any user-specified genome region (i.e., gene, intron, or intergenic spacer)
Automatic determination if DNA sequence alignment based on amino acid (for genes) or nucleotide (for introns and intergenic spacers) sequence information
Parallelized processing for faster extraction and alignment across multiple CPUs

Additional features

Concatenation of all genome regions alignment either in alphabetic order or based on location in genome (first input genome used as reference)
Automatic standardization of tRNA gene names to accommodate letter case differences among the gene annotations of different input genomes (e.g., for anticodon and amino acid abbreviations of tRNAs) [see clean_gene()]
Flexible configuration of alignment tools via user-defined parameters
Production of informative logs; two detail levels:
- default (suitable for regular software execution)
- verbose (suitable for debugging)
Clear reporting when genome regions cannot be extracted

Input/output

Input

Set of complete plastid genomes (each in GenBank flatfile format)

Output

DNA sequence alignments of individual genome regions (FASTA format)
Concatenation of all individual DNA sequence alignments (FASTA and NEXUS format)

Installation on Linux (Debian)

# Installation
#pip install git+https://github.com/michaelgruenstaeudl/PlastomeBurstAndAlign.git  
#pip3 install git+ssh://git@github.com/michaelgruenstaeudl/PlastomeBurstAndAlign.git

pip install plastburstalign

You must manually install one of the supported alignment tools (MAFFT, MUSCLE, Clustal Omega). If the executable is NOT installed in PATH, provide tool path in input.

Installation of External alignment tools

Option 1 — Conda environment (recommended)

git clone https://github.com/michaelgruenstaeudl/PlastomeBurstAndAlign.git
cd PlastomeBurstAndAlign
conda env create -f environment.yml
conda activate plastburstalign

This environment installs Python dependencies and external alignment tools (MAFFT, MUSCLE, Clustal Omega).

Option 2 — Manual installation

sudo apt install mafft muscle clustalo

Usage

Command-line (recommended)

After installation via pip, run:

plastburstalign

Example run

plastburstalign \
  -i Input_dataset \
  -o Output_dataset \
  -s cds \
  -a mafft

Parameters overview

Option	Description	Example
`-i`	Input dataset directory	`Input_dataset`
`-o`	Output directory	`Output_dataset`
`-s`	Sequence type to extract (e.g., cds, int, igs)	`cds`
`-a`	Alignment tool to use	`mafft`
`-l`	Minimum sequence length (bp); regions shorter than this are excluded	`9`
`-t`	Minimum number of taxa (int) or relative frequency (0<value<=1) in which a region must be present to be extracted	`0.1`
`-n`	Number of threads to use	`8`
`--config`	Path to YAML config file containing parameters	`config.yaml`

Source usage

If you cloned the repository:

git clone https://github.com/michaelgruenstaeudl/PlastomeBurstAndAlign.git
cd PlastomeBurstAndAlign
python -m plastburstalign

Python API

You can also use the package directly in Python:

from plastburstalign import PlastomeBurstAndAlign

burst = PlastomeBurstAndAlign()
burst.execute()

Usage of individual package components

Individual components can be used as well. For example, to use the class MAFFT by itself (e.g., instantiate a configuration of MAFFT that will execute with 1 thread; institute another that will execute with 10 threads), type:

from plastburstalign import MAFFT, MUSCLE, ClustalOmega

mafft = MAFFT({"num_threads": 4})
muscle = MUSCLE()
clustal = ClustalOmega()

Details on exon splicing

The plastid genome is a mosaic of individual genome regions, with many of its genes consisting of multiple exons. To align genes based on their amino acid sequence information, all exons of a gene must be extracted and concatenated prior to alignment. plastburstalign conducts this exon splicing through an automated process that differentiates between cis- or trans-spliced genes: the exons of cis-spliced genes are adjacent to each other, those of trans-spliced genes are not. The software concatenates the exons of any cis-spliced gene in place (i.e., no repositioning of the exons necessary). The exons of any trans-spliced gene (e.g., rps12), by contrast, undergo a two-step repositioning procedure before being concatenated. First, groups of contiguous exons are formed based on their location information: if an exon is adjacent to or even overlaps with another exon of the same gene name, they are merged. Second, exons of the same gene name are merged at the location of the first exon occurrence.

Details on removal of user-specified genome regions from alignment

Due to the size and complexity of large DNA sequence alignments, individual genome regions can barely be removed from a concatenated sequence alignment; instead, any user-specified exclusion of a genome region must be performed before the actual sequence alignment. plastburstalign contains two functions for such an exclusion: commandline-parameter exclude_region excludes any user-specified region by exact name match from the dataset; commandline-parameter exclude_fullcds removes entire user-specified genes as well as any introns inside, and any intergenic spacers immediately adjacent to, the specified genes from the dataset.

Details on automatic standardization of tRNA gene names

The names of all tRNAs are automatically standardized across the input genomes to counteract the accumulation of idiosyncratic gene names. tRNAs are often labeled differently by different researchers. For example, researcher A may label tRNAs with both amino acid abbreviations and anticodons (e.g., trnA-Leu-UAA), whereas researcher B may label them with the respective anticodons only (e.g., trnA-UAA). Similarly, researcher C may label tRNA genes with lower-case anticodons (e.g., trnA-uaa) but researcher D with upper-case anticodons (e.g., trnA-UAA). Differences in tRNA gene names may also originate from the idiosyncratic use of dashes versus underscores (e.g., trnA-UAA versus trnA_UAA). Leaving the names of tRNAs that code for the same gene unadjusted and, thus, incongruent across different input genomes risks the artificial increase in the number of unique genes, introns, and intergenic spacers in the dataset.

To ensure that only homologous genes are grouped together and aligned, plastburstalign automatically standardizes tRNA gene names across input genomes. Specifically, the software homogenizes incongruent tRNAs gene names to a single format: tRNAabbreviation_anticodon (e.g., trnA_UAA). This format is (i) the most commonly used tRNA naming scheme among plastid genomes and (ii) the least problematic scheme for nucleotide sequence alignment operations, which typically interpret dashes as sequence characters. During the standardization operations, plastburstalign utilizes the three-letter amino acid abbreviations and the anticodon definitions of translation table 11 of the International Nucleotide Sequence Database Collaboration (INSDC). tRNAs with more than one possible codon but for which neither amino acid nor anticodon abbreviations are given in the gene name (e.g., trnA can be any of the following: trnA_UAA, trnA_CAA, trnA_AAG, trnA_GAG, trnA_UAG, and trnA_CAG), by contrast, are not changed by plastburstalign to avoid the incorrect designations.

As a side effect, the automatic standardization of tRNA gene names also decreases the number of annotated genome regions that need to be removed from the dataset for not reaching the minimum number of taxa defined. Without the standardization, the intergenic spacer between the genes trnA_CAA and ndhB, for example, may be grouped under two different names (e.g., trnA_CAA_ndhB and trnA_caa_ndhB), with the latter group being less common and eventually removed from the dataset for not reaching the minimum number of taxa. By implementing a gene name standardization, the same intergenic spacer is grouped under only one name (i.e., trnA_CAA_ndhB) and not discarded. Preliminary tests indicated that the number of annotated genome regions that were removed due to not reaching the minimum number of taxa was decreased by approximately 25% through the tRNA gene name standardization.

Exemplary usage

See this document

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.9.9

Jun 11, 2026

This version

0.9.8

May 24, 2026

0.9.7

May 23, 2026

0.9.6

May 22, 2026

0.9.5

May 5, 2026

0.9.4

May 1, 2026

0.9.3

May 1, 2026

0.9.2

May 1, 2026

0.9.1

Aug 27, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

plastburstalign-0.9.8.tar.gz (34.9 kB view details)

Uploaded May 24, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

plastburstalign-0.9.8-py3-none-any.whl (34.4 kB view details)

Uploaded May 24, 2026 Python 3

File details

Details for the file plastburstalign-0.9.8.tar.gz.

File metadata

Download URL: plastburstalign-0.9.8.tar.gz
Upload date: May 24, 2026
Size: 34.9 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for plastburstalign-0.9.8.tar.gz
Algorithm	Hash digest
SHA256	`f342dbe406d91ee6ad25ec8c853e1867bee4b7db1d111d4031c85b9d27eea30b`
MD5	`ea7936ff3bad6be3a95b876a7c51f92c`
BLAKE2b-256	`6a7f698c92ae0a6332e50d4ca6c2b2b7c8fbee49414b0182cdcb9c2e4cd7a4fb`

See more details on using hashes here.

Provenance

The following attestation bundles were made for plastburstalign-0.9.8.tar.gz:

Publisher: build-and-publish.yaml on michaelgruenstaeudl/PlastomeBurstAndAlign

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: plastburstalign-0.9.8.tar.gz
- Subject digest: f342dbe406d91ee6ad25ec8c853e1867bee4b7db1d111d4031c85b9d27eea30b
- Sigstore transparency entry: 1616447692
- Sigstore integration time: May 24, 2026
Source repository:
- Permalink: michaelgruenstaeudl/PlastomeBurstAndAlign@a6dc5d44bc3c51ba595808407692dc08a0975cb4
- Branch / Tag: refs/heads/main
- Owner: https://github.com/michaelgruenstaeudl
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: build-and-publish.yaml@a6dc5d44bc3c51ba595808407692dc08a0975cb4
- Trigger Event: push

File details

Details for the file plastburstalign-0.9.8-py3-none-any.whl.

File metadata

Download URL: plastburstalign-0.9.8-py3-none-any.whl
Upload date: May 24, 2026
Size: 34.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for plastburstalign-0.9.8-py3-none-any.whl
Algorithm	Hash digest
SHA256	`93983ef7170aa37b757d69555992c2db96af5347f83e5973ac88988cc4db409c`
MD5	`8e28483485a9c73e973b6fbb89a83397`
BLAKE2b-256	`393c8a6598ba13dd05b6170fc6e8d809f9dcb11c4e963b7a8bb08897bb333af7`

See more details on using hashes here.

Provenance

The following attestation bundles were made for plastburstalign-0.9.8-py3-none-any.whl:

Publisher: build-and-publish.yaml on michaelgruenstaeudl/PlastomeBurstAndAlign

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: plastburstalign-0.9.8-py3-none-any.whl
- Subject digest: 93983ef7170aa37b757d69555992c2db96af5347f83e5973ac88988cc4db409c
- Sigstore transparency entry: 1616447707
- Sigstore integration time: May 24, 2026
Source repository:
- Permalink: michaelgruenstaeudl/PlastomeBurstAndAlign@a6dc5d44bc3c51ba595808407692dc08a0975cb4
- Branch / Tag: refs/heads/main
- Owner: https://github.com/michaelgruenstaeudl
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: build-and-publish.yaml@a6dc5d44bc3c51ba595808407692dc08a0975cb4
- Trigger Event: push

plastburstalign 0.9.8

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

plastburstalign

Purpose

Background

Overview of process

Main features

Additional features

Input/output

Input

Output

Installation on Linux (Debian)

Installation of External alignment tools

Option 1 — Conda environment (recommended)

Option 2 — Manual installation

Usage

Command-line (recommended)

Example run

Parameters overview

Source usage

Python API

Usage of individual package components

Details on exon splicing

Details on removal of user-specified genome regions from alignment

Details on automatic standardization of tRNA gene names

Exemplary usage

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance