Functional ANnoTAtion based on embedding space SImilArity
Project description
FANTASIA
FANTASIA (Functional ANnoTAtion based on embedding space SImilArity) is a pipeline for annotating Gene Ontology (GO) terms for protein sequences using advanced protein language models like ProtT5, ProstT5, and ESM2. This system automates complex workflows, from sequence processing to functional annotation, providing a scalable and efficient solution for protein structure and functionality analysis.
Key Features
- Redundancy Filtering: Removes identical sequences with CD-HIT and optionally excludes sequences based on length constraints.
- Embedding Generation: Utilizes state-of-the-art models for protein sequence embeddings.
- GO Term Lookup: Matches embeddings with a vector database to retrieve associated GO terms.
- Results: Outputs annotations in timestamped CSV files for reproducibility.
Installation
To install FANTASIA, ensure you have Python 3.8+ installed and use the following commands:
pip install fantasia
Quick Start
Prerequisites
Ensure the Information System is properly configured before running FANTASIA. Detailed instructions are available in the project documentation.
Running the Pipeline
Execute the following command, specifying the path to the configuration file:
python main.py --config <path_to_config.yaml>
Pipeline Overview
- Redundancy Filtering: Removes identical sequences and optionally filters sequences based on length.
- Embedding Generation: Computes embeddings for sequences using supported models and stores them in HDF5 format.
- GO Term Lookup: Queries a vector database to find and annotate similar proteins.
- Output: Saves annotations in a structured CSV file.
Documentation
For complete details on pipeline configuration, parameters, and deployment, visit the FANTASIA Documentation.
Citation
If you use FANTASIA in your work, please cite the following:
-
Martínez-Redondo, G. I., Barrios, I., Vázquez-Valls, M., Rojas, A. M., & Fernández, R. (2024). Illuminating the functional landscape of the dark proteome across the Animal Tree of Life.
https://doi.org/10.1101/2024.02.28.582465. -
Barrios-Núñez, I., Martínez-Redondo, G. I., Medina-Burgos, P., Cases, I., Fernández, R. & Rojas, A.M. (2024). Decoding proteome functional information in model organisms using protein language models.
https://doi.org/10.1101/2024.02.14.580341.
Contact Information
- Francisco Miguel Pérez Canales: fmpercan@upo.es
- Gemma I. Martínez-Redondo: gemma.martinez@ibe.upf-csic.es
- Ana M. Rojas: a.rojas.m@csic.es
- Rosa Fernández: rosa.fernandez@ibe.upf-csic.es
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file fantasia-0.2.0.tar.gz.
File metadata
- Download URL: fantasia-0.2.0.tar.gz
- Upload date:
- Size: 11.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.4.0 CPython/3.10.16 Linux/6.5.0-1025-azure
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7b0046cf7255d9c8313e9e4c7673cb3db86657eb3baa238bb5b059efa2f39aa0
|
|
| MD5 |
d18d37430354768929ec0f94c64af8f7
|
|
| BLAKE2b-256 |
a5dc5af58faa223a09b0a111698285f1a4e4c051b4f9f0f43de77f4318a7befa
|
File details
Details for the file fantasia-0.2.0-py3-none-any.whl.
File metadata
- Download URL: fantasia-0.2.0-py3-none-any.whl
- Upload date:
- Size: 14.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.4.0 CPython/3.10.16 Linux/6.5.0-1025-azure
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ddfe1331068c435885d4420d14f986d6cfed9bfd558170ae5ee10a82cd163456
|
|
| MD5 |
7d6effe9c75b19cab8631d714cfabb23
|
|
| BLAKE2b-256 |
8e7b73006bd587dcfe905f5728b6a5b830059c61d447accbe45ccc6dae6e3791
|