Skip to main content

Annotation Extraction Genomic Integration Suite.

Project description

AEGIS Logo

AEGIS: Annotation Extraction Genomic Integration Suite

License: GPL v3 Python Version PyPI version Docker Hub GitHub license

AEGIS is a powerful and flexible Python-based suite for the manipulation, analysis, and integration of genomic annotations. It provides a robust, object-oriented framework for working with genomic data, enabling complex analyses and data transformations with intuitive, high-level commands.

Key Features

  • Object-Oriented Design: AEGIS represents genomic features (genes, transcripts, exons, etc.) as a hierarchical system of custom Python classes, providing a clean and intuitive API for data manipulation.
  • Comprehensive Annotation Handling: Seamlessly parse, process, and export genomic annotations in GFF3 format.
  • Extensible and Modular: The modular design of AEGIS allows for easy extension and integration with other bioinformatics tools and pipelines.
  • Command Line Interface: Running "aegis --help" in the terminal will show an updated list of the available commands whilst individual command help can be found with "aegis {command} --help". There is a total of 14 commands and some of the key functionalities are: tidy up and/or reformat gff/fasta files, sequence extraction, summary annotation statistics, merging of annotations, and comparative genomic analyses such as orthology detection and synteny analysis between annotation files associated to different genomes.

The AEGIS Class System

The core of AEGIS is its custom class system, which models the hierarchical nature of genomic annotations. This object-oriented approach provides several key advantages over traditional, line-by-line processing of annotation files:

  • Intuitive Data Representation: Genomic features are not just lines in a file; they are objects with properties and relationships. A Gene object contains Transcript objects, which in turn contain Exon and CDS objects. This makes the code more readable, maintainable, and less error-prone.
  • Data Integrity: The class system enforces data consistency. For example, when a Gene object is updated, all its associated Transcript and sub-feature objects are updated accordingly, ensuring that the annotation remains coherent.
  • Complex Queries and Manipulations: The object-oriented structure allows for complex queries and manipulations that would be difficult to perform with traditional text-based tools. For example, you can easily retrieve all coding transcripts for a specific gene, or calculate the total length of all exons in a given transcript.
  • Code Reusability: The class-based design promotes code reusability. Once you have defined a class for a specific genomic feature, you can reuse it in different parts of your analysis pipeline.
  • Robust Maintenance: Unit tests and continuous integration ensure that the code is reliable and maintainable.

Core Classes

  • Genome: Represents a genome, containing a collection of Scaffold objects.
  • Scaffold: Represents a chromosome or scaffold, containing the sequence and a collection of Gene objects.
  • Annotation: The main container for genomic annotations, holding a collection of Gene objects.
  • Gene: Represents a gene, containing one or more Transcript objects.
  • Transcript: Represents a transcript, containing Exon, CDS, and UTR objects.
  • Exon, CDS, UTR, Intron: Represent the sub-features of a transcript.
  • Protein, Promoter: Represent other biological features of interest.

Installation

You can install and run AEGIS in several ways. Using a container (Docker or Singularity) is the recommended approach as it handles all dependencies automatically.

Using Docker (Recommended)

If you have Docker installed, you can easily pull and run the pre-built AEGIS image from Docker Hub. This image includes AEGIS and all third-party software used for orthology analyses.

1a. Pull the image from Docker Hub:

docker pull tomsbiolab/aegis

1b. OR Pull the image from GHRC:

docker pull ghcr.io/tomsbiolab/aegis

2. Run an AEGIS command: The following command runs aegis extract on a test dataset. The -v flag is crucial as it makes your current directory accessible inside the container.

docker run --rm -ti -v `pwd`:`pwd` -w `pwd` tomsbiolab/aegis aegis extract -f protein test_data/arabidopsis_araport11.gff3 test_data/arabidopsis_tair10.fasta

3. (Optional) Build the image locally: If you want to build the image from the source code in this repository, you can use the provided Dockerfile.

docker build -t aegis local .

You can then run your local image by replacing tomsbiolab/aegis with aegis local.

Using Singularity

For high-performance computing (HPC) environments where Docker is not available, Singularity is an excellent alternative.

1. Build the Singularity image from Docker Hub:

singularity build aegis.sif docker://tomsbiolab/aegis

This will create a single aegis.sif file in your current directory.

2. Run an AEGIS command: Use the singularity run command to execute AEGIS. The -B flag mounts your current directory into the container.

singularity run -B `pwd`:`pwd` aegis.sif aegis extract -f protein test_data/arabidopsis_tair10.gff3 test_data/arabidopsis_tair10.fasta

From PyPI (Python Package Index)

Easiest way to install, however, some dependencies used in 'aegis orthology' will be missing (such as Liftoff, LiftOn, MCScan, Orthofinder, Diamond...). If you are planning to use 'aegis orthology' you will require these, so to avoid having to install the dependencies yourself see docker and singularity options above. The latest version in pypi will always match the version of the latest release.

pip3 install aegis-bio

From Source

Alternatively, you can install AEGIS directly from the source by cloning the repository and installing the required Python dependencies.

git clone https://github.com/Tomsbiolab/aegis.git
cd aegis
pip install .

# Or for development (editable mode):
pip install -e .

Usage

AEGIS is designed to be used as a library in your Python scripts or directly through the CLI

CLI commands

All of the commands are called with aegis {subcommand} in a terminal:

  • Native Tools (included in pip install):
    • Extract
      • Extracts all kinds of fasta features from an annotation.
    • Overlap
      • Overlap quantification of gene models (and their subfeatures) between any number of gffs associated to same genome
    • Rename
      • Rename gff feature ids.
    • Summary
      • Outputs tabular annotation stats as well as a series of plots.
    • Tidy
      • Cleans annotation files, fixes errors, issues warnings, and provides custom formatting options for extra flexibility/compatibility with third party tools.
    • Tidy-genome
      • Allows removal and/or renaming of genome features.
    • Merge
      • Custom merge of any number of gffs, prevent id clashes and control redundancy in same loci.
    • Symbols
      • Allows to add gene symbols into an annotation file based on tabular input
    • Motifs
      • Plots frequency of a particular DNA motif (allowing regular expressions) in promoter regions of chosen gene lists, all genome’s genes, and random gene lists.
    • Subset
      • Make all sorts of subsets of an annotation file, select by desired features (coding/non-coding) or even create lite versions for debugging/testing.
    • Prune
      • Removes features based on id lists (transcript or gene level) and solves any derived issues, i.e. remove a gene if all of its transcripts are removed
    • Reformat:
      • Converts between gtf and gff formats.
    • List:
      • Lists gene ids or transcript ids from an annotation file, optionally selecting which types ofgenes/transcripts to include/exclude.
  • Integrative Pipelines (require Docker/Singularity or manual install):
    • Orthology
      • Comprehensive multi-tool comparison of gene ids from different genomes. All evidence is summarised and converted to a qualitative scale, allowing to select orthologues by confidence level.

As a Python library

Here is a simple example of how to load an annotation and extract the sequences of all genes:

from aegis.annotation import Annotation
from aegis.genome import Genome

# Load the genome and annotation
genome = Genome(name = "my_genome", genome_file_path = "path/to/genome.fasta")
annotation = Annotation(name = "my_annotation", annot_file_path = "path/to/annotation.gff3", genome=genome)

# Generate and export gene sequences
annotation.export.genes()

Documentation

For further and more detailed information on how to use the AEGIS package, including Jupyter Notebook examples, please refer to the GitHub Wiki. The wiki provides comprehensive guides and tutorials to help you get the most out of the suite.

Link to Wiki: https://github.com/Tomsbiolab/aegis/wiki

Citation

If you use AEGIS in your research, please cite the following preprint:

Navarro-Payá, D., Santiago, A., Velt, A., Moretto, M., Rustenholz, C., & Matus, J. T. (2025). AEGIS: an annotation extraction and genomic integration resource. bioRxiv. doi: 10.64898/2025.12.04.692274v1

License

AEGIS is licensed under the GNU General Public License v3.0. See the LICENSE.md file for more details. Third-party tools included in the Docker image are distributed under their respective licenses.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aegis_bio-0.3.0.tar.gz (144.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

aegis_bio-0.3.0-py3-none-any.whl (135.2 kB view details)

Uploaded Python 3

File details

Details for the file aegis_bio-0.3.0.tar.gz.

File metadata

  • Download URL: aegis_bio-0.3.0.tar.gz
  • Upload date:
  • Size: 144.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for aegis_bio-0.3.0.tar.gz
Algorithm Hash digest
SHA256 70d1df22121b68951aa23245280de4d35014efd2c0841f7a67fbc3377687d852
MD5 333af8299f176ba2738d727f4bdda4c1
BLAKE2b-256 3e9d0ad5169191187ee3cb27246cc3ec360c9a016bd91b71de86f78312ecfbed

See more details on using hashes here.

File details

Details for the file aegis_bio-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: aegis_bio-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 135.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for aegis_bio-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6aa8c37a9539f22276397ccf692a6c9d6ae42d89446d5985f30498e9682b3916
MD5 1a40821d36f38342b8bcd899ee7802f2
BLAKE2b-256 73bbb1994206af7436fbe70d82676c3ccce3bb1558c8beed4cef794f4b7f8d49

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page