Functional ANnoTAtion based on embedding space SImilArity
Project description
FANTASIA v4.1
Functional ANnoTAtion based on embedding space SImilArity
FANTASIA is an advanced pipeline for the automatic functional annotation of protein sequences using state-of-the-art protein language models. It integrates deep learning embeddings and in-memory similarity searches, retrieving reference vectors from a PostgreSQL database with pgvector, to associate Gene Ontology (GO) terms with proteins.
For full documentation, visit FANTASIA Documentation.
For users who need a lightweight, standalone alternative, FANTASIA-Lite provides fast Gene Ontology annotation directly from local FASTA files, without requiring a database server or the full FANTASIA infrastructure. It leverages protein language model embeddings and nearest-neighbor similarity in embedding space to deliver high-quality functional annotations with minimal setup.
For FANTASIA-Lite, visit https://github.com/CBBIO/FANTASIA-Lite
Reference Datasets
Two packaged reference datasets are available; select one depending on your analysis needs:
-
Main Reference (last layer, default)
Embeddings extracted only from the final hidden layer of each PLM.
Recommended for most annotation tasks (smaller, faster to load).
Record: https://zenodo.org/records/17795871 -
Multilayer Reference (early layers + final layers)
Embeddings extracted from multiple hidden layers (including intermediate and final).
Suitable for comparative and exploratory analyses requiring layer-wise representations.
Record: https://zenodo.org/records/17793273
Key Features
✅ Available Embedding Models
Supports protein language models: ESM-2, ProtT5, ProstT5, Ankh3-Large, and ESM3c for sequence representation.
-
🔍 Redundancy Filtering
Filters out homologous sequences using MMseqs2 in the lookup table, allowing controlled redundancy levels through an adjustable threshold, ensuring reliable benchmarking and evaluation. -
💾 Optimized Data Storage
Embeddings are stored in HDF5 format for input sequences. The reference table, however, is hosted in a public relational PostgreSQL database using pgvector. -
🚀 Efficient Similarity Lookup
High-throughput similarity search with a hybrid approach: reference embeddings are stored in a PostgreSQL + pgvector database and fetched in batches to memory to compute similarities at speed. -
🧭 Global & Local Alignment of Hits
Candidate hits from the reference table are aligned both globally and locally against the input protein for validation and scoring. -
🧩 Multi-layer Embedding Support
Optional support for intermediate + final layers to enable layer-wise analyses and improved exploration. -
📦 Raw Outputs & Flexible Post-processing
Exposes raw result tables for custom analyses and includes a flexible post-processing & scoring system that produces TopGO-ready files.
Performs high-speed searches using in-memory computations. Reference vectors are retrieved from a PostgreSQL database with pgvector for comparison. -
🔬 Functional Annotation by Similarity
Assigns Gene Ontology (GO) terms to proteins based on embedding space similarity, using pre-trained embeddings from all supported models.
Pipeline Overview (Simplified)
-
Embedding Generation
Computes protein embeddings using deep learning models (ProtT5, ProstT5, ESM2 and Ankh). -
GO Term Lookup
Performs vector similarity searches using in-memory computations to assign Gene Ontology terms. Reference embeddings are retrieved from a PostgreSQL database with pgvector. Only experimental evidence codes are used for transfer.
� Setting Up Required Services with Docker Compose
FANTASIA requires two key services:
- PostgreSQL 16 with pgvector: Stores reference protein embeddings and provides vector similarity search
- RabbitMQ: Message broker for distributed embedding task processing
Prerequisites
- Docker and Docker Compose installed
Quick Start
-
Start services (from the FANTASIA directory):
docker-compose up -d
-
Verify services are running:
docker-compose psExpected output:
CONTAINER ID IMAGE STATUS xxx pgvector/pgvector:0.7.0-pg16 Up (healthy) xxx rabbitmq:3.13-management Up (healthy) -
Test database connection:
PGPASSWORD=clave psql -h localhost -U usuario -d BioData -c "SELECT 1"
Service Credentials
The docker-compose.yml is configured with the following default credentials (matching config.yaml):
| Service | Host | Port | User | Password | Database |
|---|---|---|---|---|---|
| PostgreSQL | localhost | 5432 | usuario | clave | BioData |
| RabbitMQ | localhost | 5672 | guest | guest | - |
RabbitMQ Management UI is available at: http://localhost:15672 (user: guest, password: guest)
Troubleshooting
Connection refused error:
# Check if containers are running
docker-compose ps
# If stopped, restart them
docker-compose restart
# View logs
docker-compose logs postgres
docker-compose logs rabbitmq
Password authentication failed:
Ensure the credentials in docker-compose.yml match those in config.yaml:
# Current values in docker-compose.yml
POSTGRES_USER: usuario
POSTGRES_PASSWORD: clave
POSTGRES_DB: BioData
Cleaning up: To remove containers and volumes:
docker-compose down -v
�📚 Supported Embedding Models
| Name | Model ID | Params | Architecture | Description |
|---|---|---|---|---|
| ESM-2 | facebook/esm2_t33_650M_UR50D |
650M | Encoder (33L) | Learns structure/function from UniRef50. No MSAs. Optimized for accuracy. |
| ProtT5 | Rostlab/prot_t5_xl_uniref50 |
1.2B | Encoder-Decoder | Trained on UniRef50. Strong transfer for structure/function tasks. |
| ProstT5 | Rostlab/ProstT5 |
1.2B | Multi-modal T5 | Learns 3Di structural states + function. Enhances contact/function tasks. |
| Ankh3-Large | ElnaggarLab/ankh3-large |
620M | Encoder (T5-style) | Fast inference. Good semantic/structural representation. |
| ESM3c | esmc_600m |
600M | Encoder (36L) | New gen. model trained on UniRef + MGnify + JGI. High precision & speed. |
Acknowledgments
FANTASIA is the result of a collaborative effort between Ana Rojas’ Lab (CBBIO) (Andalusian Center for Developmental Biology, CSIC) and Rosa Fernández’s Lab (Metazoa Phylogenomics Lab, Institute of Evolutionary Biology, CSIC-UPF). This project demonstrates the synergy between research teams with diverse expertise.
This version of FANTASIA builds upon previous work from:
-
Metazoa Phylogenomics Lab's FANTASIA
The original implementation of FANTASIA for functional annotation. -
bio_embeddings
A state-of-the-art framework for generating protein sequence embeddings. -
GoPredSim
A similarity-based approach for Gene Ontology annotation. -
protein-information-system
Serves as the reference biological information system, providing a robust data model and curated datasets for protein structural and functional analysis.
We also extend our gratitude to LifeHUB-CSIC for inspiring this initiative and fostering innovation in computational biology.
Citing FANTASIA
If you use FANTASIA in your research, please cite the following publications:
-
Martínez-Redondo, G. I., Barrios, I., Vázquez-Valls, M., Rojas, A. M., & Fernández, R. (2024).
Illuminating the functional landscape of the dark proteome across the Animal Tree of Life.
DOI: 10.1101/2024.02.28.582465 -
Barrios-Núñez, I., Martínez-Redondo, G. I., Medina-Burgos, P., Cases, I., Fernández, R., & Rojas, A. M. (2024).
Decoding proteome functional information in model organisms using protein language models.
DOI: 10.1101/2024.02.14.580341
License
FANTASIA is distributed under the terms of the GNU Affero General Public License v3.0.
👥 Project Team
- Ana M. Rojas: a.rojas.m@csic.es
- Rosa Fernández: rosa.fernandez@ibe.upf-csic.es
- Gemma I. Martínez-Redondo: gemma.martinez@ibe.upf-csic.es
- Francisco Miguel Pérez Canales: fmpercan@upo.es
- Belén Carbonetto: belen.carbonetto.metazomics@gmail.com
- Francisco J. Ruiz Mota: fraruimot@alum.us.es
- Àlex Domínguez Rodríguez: adomrod4@upo.es
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file fantasia-4.1.1.tar.gz.
File metadata
- Download URL: fantasia-4.1.1.tar.gz
- Upload date:
- Size: 64.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.4.0 CPython/3.12.12 Linux/6.8.0-1044-azure
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2435aa75658df1eae4563edbf3b769004de15ea069906934e71576cd5a40bd13
|
|
| MD5 |
13a1f93933d41c214ba145f321062dc1
|
|
| BLAKE2b-256 |
ead2eb56e4ccd9bd5a6195e28804bcb77547296095c7f38317d397aaa0d56a59
|
File details
Details for the file fantasia-4.1.1-py3-none-any.whl.
File metadata
- Download URL: fantasia-4.1.1-py3-none-any.whl
- Upload date:
- Size: 65.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.4.0 CPython/3.12.12 Linux/6.8.0-1044-azure
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6151ce5b5d4ae077a181f3100119a9452a093e2cc3c3b3d9bcf64c786c64e3e5
|
|
| MD5 |
810e5a0a487d8bddafe760e7c942859c
|
|
| BLAKE2b-256 |
5f88b02e2463d633e2065b9677c32a5054e89c23a75471a5160fdd345d1dd9af
|