Skip to main content

A Python tool to extract and align genes, introns, and intergenic spacers across hundreds of plastid genomes using associative arrays

Project description

plastburstalign

A Python tool to extract and align genes, introns, and intergenic spacers across hundreds of plastid genomes using associative arrays

Installation on Linux (Debian)

# Alignment software
apt install mafft

# Other dependencies
apt install python3-biopython
apt install python3-coloredlogs

Overview of process

Depiction of plastomes being split according to specified marker type; the extracted sequences are then aligned and concatenated

Main features

  • Extracts all sequences from each of three different genome marker types (i.e., genes, introns, or intergenic spacers) from a set of plastid genomes in GenBank flatfile format, groups and aligns homologous extracted sequences, and then saves the alignments to file
  • Saves both the individual alignments and the concatenation of all alignments
  • Automatic removal of any duplicate regions (i.e., relevant for features duplicated through the IRs)
  • Exon splicing operations #1: Automatic merging of all exons of any cis-splied gene [see functions of class ExonSpliceHandler]
  • Exon splicing operations #2: Automatic grouping of all exons of any trans-spliced gene (e.g., rps12), followed by merging the exons [see functions of class ExonSpliceHandler]
  • Automatic removal of regions that do not fulfill
    • a minimum, user-specified sequence length
    • a minimum, user-specified number of taxa in the dataset that the region must be found in [see function DataCleaning() for both]

Additional features

  • Rapid sequence extraction and alignment of the genes/introns/intergenic spacers due to process parallelization using multiple CPUs [see internal function _nuc_MSA()]
  • Automatic removal of any user-specified genes/introns/intergenic spacers
  • Choice of
    • the order of concatenation of the aligned genes/introns/intergenic spacers to either the natural order of the first input genome (commandline option seq) or an alphabetic order (commandline option alpha)
    • automatic case standardization of gene names to adjust for letter-case differences between gene annotations of different genome records (which is especially relevant for anticodon and amino acid abbreviations of tRNAs); includes the option to remove anticodon and amino acid abbreviations from tRNA gene names altogether [see function clean_gene()]
  • If a gene/intron/intergenic spacer cannot be extracted from a GenBank record, provision of explanation why the extraction failed
  • Availability of two log levels:
    • default (suitable for regular software execution), and
    • verbose (suitable for debugging)
  • Package works out of the box on Unix-like systems due to inclusion of the alignment software executable (MAFFT) into the package.

Usage

Option 1: As a script

If current working directory within plastburstalign, execute the package via:

python -m plastburstalign

Option 2: As a module

From within Python, execute the package functions via:

from plastburstalign import PlastomeRegionBurstAndAlign
burst = PlastomeRegionBurstAndAlign()
burst.execute()

Usage of individual package components

Individual components can be used as well. For example, to use the class MAFFT by itself (e.g., instantiate a configuration of MAFFT that will execute with 1 thread; institute another that will execute with 10 threads), type:

from plastburstalign import MAFFT

mafft_1 = MAFFT()
mafft_10 = MAFFT({"num_threads": 10})

Explanation of exon splicing

As the gene list produced through parsing all input genomes is iterated over, genes that comprise multiple exons are automatically flagged and treated according to the distance between their exons. Cis-spliced genes only comprise exons that are adjacent to each other, trans-spliced genes comprise one or more exons that are not adjacent to each other. This software merges the exons of any cis-spliced gene in place (i.e., according to the location specified by the source GenBank file; no repositioning of the exons necessary). The exons of any trans-spliced gene (e.g., rps12), by contrast, undergo a repositioning before being merged. Specifically, the software accommodates the fact that GenBank flatfiles list trans-spliced genes (e.g., rps12) out of their natural order along the genome sequence and additionally repositions the exons of trans-spliced genes by converting them to adjacent exons and then merges these exons.

For the repositioning of trans-spliced gene, all annotations of that gene are first moved from the main gene list to a separate list. Then, the annotations are split into simple location features for each contiguous group of exons. Third, the expected location of each of these simple gene features is determined by comparing its end location with the end locations of the gene features in the main gene list: if the expected location has no overlap with either the proceeding and succeeding genes and the feature is different in name from either, it is directly inserted into that location. Alternatively, if the expected location of the feature results in a flanking gene (strictly adjacent or overlapping) with the same name, the annotations are merged; the merging is true for both the proceeding and the succeeding gene.

Testing

cd benchmarking
# CDS
python test_script_cds.py benchmarking1
python test_script_cds.py benchmarking2
# INT
python test_script_int.py benchmarking1
python test_script_int.py benchmarking2
# IGS
python test_script_igs.py benchmarking1
python test_script_igs.py benchmarking2

Exemplary usage

See this document

Generating more test data

See this document

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

plastburstalign-0.9.1.tar.gz (27.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

plastburstalign-0.9.1-py3-none-any.whl (28.3 kB view details)

Uploaded Python 3

File details

Details for the file plastburstalign-0.9.1.tar.gz.

File metadata

  • Download URL: plastburstalign-0.9.1.tar.gz
  • Upload date:
  • Size: 27.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.0 CPython/3.12.5

File hashes

Hashes for plastburstalign-0.9.1.tar.gz
Algorithm Hash digest
SHA256 8d8a68e12f70f553801960e429b9201effd6d0d44618fddc9e5b525d50f979db
MD5 9df50744e97a8b32081252b2c813b67b
BLAKE2b-256 7859d1d97f7bb1701daa5c84a065c973adfda1c3ad96b0c8cf0e87313e563b53

See more details on using hashes here.

File details

Details for the file plastburstalign-0.9.1-py3-none-any.whl.

File metadata

File hashes

Hashes for plastburstalign-0.9.1-py3-none-any.whl
Algorithm Hash digest
SHA256 1c18839df8c8ffc473ff05cba5f5c5e07bd01208ef7b294e9ced0f7d443b452d
MD5 72d254cbe5ee5dfccf492b97db7eb498
BLAKE2b-256 e5fd649fc8ac4452d978acfc8308d82c1bf496fb1c9d1b34de260b08142545f5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page