Annotation Extraction Genomic Integration Suite.
Project description
AEGIS: Annotation Extraction Genomic Integration Suite
AEGIS is a powerful and flexible Python-based suite for the manipulation, analysis, and integration of genomic annotations. It provides a robust, object-oriented framework for working with genomic data, enabling complex analyses and data transformations with intuitive, high-level commands.
Key Features
- Object-Oriented Design: AEGIS represents genomic features (genes, transcripts, exons, etc.) as a hierarchical system of custom Python classes, providing a clean and intuitive API for data manipulation.
- Comprehensive Annotation Handling: Seamlessly parse, process, and export genomic annotations in GFF3 format.
- Extensible and Modular: The modular design of AEGIS allows for easy extension and integration with other bioinformatics tools and pipelines.
- Command Line Interface: Running "aegis --help" in the terminal will show an updated list of the available commands whilst individual command help can be found with "aegis {command} --help". There is a total of 14 commands and some of the key functionalities are: tidy up and/or reformat gff/fasta files, sequence extraction, summary annotation statistics, merging of annotations, and comparative genomic analyses such as orthology detection and synteny analysis between annotation files associated to different genomes.
The AEGIS Class System
The core of AEGIS is its custom class system, which models the hierarchical nature of genomic annotations. This object-oriented approach provides several key advantages over traditional, line-by-line processing of annotation files:
- Intuitive Data Representation: Genomic features are not just lines in a file; they are objects with properties and relationships. A
Geneobject containsTranscriptobjects, which in turn containExonandCDSobjects. This makes the code more readable, maintainable, and less error-prone. - Data Integrity: The class system enforces data consistency. For example, when a
Geneobject is updated, all its associatedTranscriptand sub-feature objects are updated accordingly, ensuring that the annotation remains coherent. - Complex Queries and Manipulations: The object-oriented structure allows for complex queries and manipulations that would be difficult to perform with traditional text-based tools. For example, you can easily retrieve all coding transcripts for a specific gene, or calculate the total length of all exons in a given transcript.
- Code Reusability: The class-based design promotes code reusability. Once you have defined a class for a specific genomic feature, you can reuse it in different parts of your analysis pipeline.
- Robust Maintenance: Unit tests and continuous integration ensure that the code is reliable and maintainable.
Core Classes
Genome: Represents a genome, containing a collection ofScaffoldobjects.Scaffold: Represents a chromosome or scaffold, containing the sequence and a collection ofGeneobjects.Annotation: The main container for genomic annotations, holding a collection ofGeneobjects.Gene: Represents a gene, containing one or moreTranscriptobjects.Transcript: Represents a transcript, containingExon,CDS, andUTRobjects.Exon,CDS,UTR,Intron: Represent the sub-features of a transcript.Protein,Promoter: Represent other biological features of interest.
Installation
You can install and run AEGIS in several ways. Using a container (Docker or Singularity) is the recommended approach as it handles all dependencies automatically.
Using Docker (Recommended)
If you have Docker installed, you can easily pull and run the pre-built AEGIS image from Docker Hub. This image includes AEGIS and all third-party software used for orthology analyses.
1a. Pull the image from Docker Hub:
docker pull tomsbiolab/aegis
1b. OR Pull the image from GHRC:
docker pull ghcr.io/tomsbiolab/aegis
2. Run an AEGIS command:
The following command runs aegis extract on a test dataset. The -v flag is crucial as it makes your current directory accessible inside the container.
docker run --rm -ti -v `pwd`:`pwd` -w `pwd` tomsbiolab/aegis aegis extract -f protein test_data/arabidopsis_araport11.gff3 test_data/arabidopsis_tair10.fasta
3. (Optional) Build the image locally:
If you want to build the image from the source code in this repository, you can use the provided Dockerfile.
docker build -t aegis local .
You can then run your local image by replacing tomsbiolab/aegis with aegis local.
Using Singularity
For high-performance computing (HPC) environments where Docker is not available, Singularity is an excellent alternative.
1. Build the Singularity image from Docker Hub:
singularity build aegis.sif docker://tomsbiolab/aegis
This will create a single aegis.sif file in your current directory.
2. Run an AEGIS command:
Use the singularity run command to execute AEGIS. The -B flag mounts your current directory into the container.
singularity run -B `pwd`:`pwd` aegis.sif aegis extract -f protein test_data/arabidopsis_tair10.gff3 test_data/arabidopsis_tair10.fasta
From PyPI (Python Package Index)
Easiest way to install, however, some dependencies used in 'aegis orthology' will be missing (such as Liftoff, LiftOn, MCScan, Orthofinder, Diamond...). If you are planning to use 'aegis orthology' you will require these, so to avoid having to install the dependencies yourself see docker and singularity options above. The latest version in pypi will always match the version of the latest release.
pip3 install aegis-bio
From Source
Alternatively, you can install AEGIS directly from the source by cloning the repository and installing the required Python dependencies.
git clone https://github.com/Tomsbiolab/aegis.git
cd aegis
pip install .
# Or for development (editable mode):
pip install -e .
Usage
AEGIS is designed to be used as a library in your Python scripts or directly through the CLI
CLI commands
All of the commands are called with aegis {subcommand} in a terminal:
- Native Tools (included in pip install):
- Extract
- Extracts all kinds of fasta features from an annotation.
- Overlap
- Overlap quantification of gene models (and their subfeatures) between any number of gffs associated to same genome
- Rename
- Rename gff feature ids.
- Summary
- Outputs tabular annotation stats as well as a series of plots.
- Tidy
- Cleans annotation files, fixes errors, issues warnings, and provides custom formatting options for extra flexibility/compatibility with third party tools.
- Tidy-genome
- Allows removal and/or renaming of genome features.
- Merge
- Custom merge of any number of gffs, prevent id clashes and control redundancy in same loci.
- Symbols
- Allows to add gene symbols into an annotation file based on tabular input
- Motifs
- Plots frequency of a particular DNA motif (allowing regular expressions) in promoter regions of chosen gene lists, all genome’s genes, and random gene lists.
- Subset
- Make all sorts of subsets of an annotation file, select by desired features (coding/non-coding) or even create lite versions for debugging/testing.
- Prune
- Removes features based on id lists (transcript or gene level) and solves any derived issues, i.e. remove a gene if all of its transcripts are removed
- Reformat:
- Converts between gtf and gff formats.
- List:
- Lists gene ids or transcript ids from an annotation file, optionally selecting which types ofgenes/transcripts to include/exclude.
- Extract
- Integrative Pipelines (require Docker/Singularity or manual install):
- Orthology
- Comprehensive multi-tool comparison of gene ids from different genomes. All evidence is summarised and converted to a qualitative scale, allowing to select orthologues by confidence level.
- Orthology
As a Python library
Here is a simple example of how to load an annotation and extract the sequences of all genes:
from aegis.annotation import Annotation
from aegis.genome import Genome
# Load the genome and annotation
genome = Genome(name = "my_genome", genome_file_path = "path/to/genome.fasta")
annotation = Annotation(name = "my_annotation", annot_file_path = "path/to/annotation.gff3", genome=genome)
# Generate and export gene sequences
annotation.export.genes()
Documentation
For further and more detailed information on how to use the AEGIS package, including Jupyter Notebook examples, please refer to the GitHub Wiki. The wiki provides comprehensive guides and tutorials to help you get the most out of the suite.
Link to Wiki: https://github.com/Tomsbiolab/aegis/wiki
Citation
If you use AEGIS in your research, please cite the following preprint:
Navarro-Payá, D., Santiago, A., Velt, A., Moretto, M., Rustenholz, C., & Matus, J. T. (2025). AEGIS: an annotation extraction and genomic integration resource. bioRxiv. doi: 10.64898/2025.12.04.692274v1
License
AEGIS is licensed under the GNU General Public License v3.0. See the LICENSE.md file for more details. Third-party tools included in the Docker image are distributed under their respective licenses.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file aegis_bio-0.3.0.tar.gz.
File metadata
- Download URL: aegis_bio-0.3.0.tar.gz
- Upload date:
- Size: 144.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
70d1df22121b68951aa23245280de4d35014efd2c0841f7a67fbc3377687d852
|
|
| MD5 |
333af8299f176ba2738d727f4bdda4c1
|
|
| BLAKE2b-256 |
3e9d0ad5169191187ee3cb27246cc3ec360c9a016bd91b71de86f78312ecfbed
|
File details
Details for the file aegis_bio-0.3.0-py3-none-any.whl.
File metadata
- Download URL: aegis_bio-0.3.0-py3-none-any.whl
- Upload date:
- Size: 135.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6aa8c37a9539f22276397ccf692a6c9d6ae42d89446d5985f30498e9682b3916
|
|
| MD5 |
1a40821d36f38342b8bcd899ee7802f2
|
|
| BLAKE2b-256 |
73bbb1994206af7436fbe70d82676c3ccce3bb1558c8beed4cef794f4b7f8d49
|