Skip to main content

Functional ANnoTAtion based on embedding space SImilArity

Project description

FANTASIA Logo

PyPI - Version Documentation Status Linting Status

FANTASIA

Functional ANnoTAtion based on embedding space SImilArity

FANTASIA is an advanced pipeline for the automatic functional annotation of protein sequences using state-of-the-art protein language models. It integrates deep learning embeddings and in-memory similarity searches, retrieving reference vectors from a PostgreSQL database with pgvector, to associate Gene Ontology (GO) terms with proteins.

For full documentation, visit FANTASIA Documentation.

⚠️ Important Notice (v3.0.0):

In previous versions of FANTASIA, all input sequences were automatically truncated at 512 amino acids, regardless of model capacity.
This may have negatively affected the accuracy of functional annotation for long proteins by generating incomplete embeddings.

Starting from version 3.0.0, this limitation has been removed. The updated pipeline now processes the full sequence length supported by each model, resulting in more accurate and biologically meaningful representations.

🔄 We strongly recommend updating to FANTASIA v3.0.0. https://zenodo.org/records/16582433 💬 For questions or issues, please contact the CBBIO group.

📌 Current Lookup Table

The lookup table used by FANTASIA — along with its detailed description and specifications — is available in the official Zenodo record:
🔗 https://zenodo.org/records/16582433

We recommend checking this record to:

  • Download the latest lookup table
  • Understand its structure and fields
  • Ensure compatibility with your workflows

Key Features

  • ✅ Available Embedding Models
    Supports protein language models: ProtT5, ProstT5, ESM2 and Ankh for sequence representation.

  • 🔍 Redundancy Filtering
    Filters out homologous sequences using MMseqs2 in the lookup table, allowing controlled redundancy levels through an adjustable threshold, ensuring reliable benchmarking and evaluation.

  • 💾 Optimized Data Storage
    Embeddings are stored in HDF5 format for input sequences. The reference table, however, is hosted in a public relational PostgreSQL database using pgvector.

  • 🚀 Efficient Similarity Lookup
    Performs high-speed searches using in-memory computations. Reference vectors are retrieved from a PostgreSQL database with pgvector for comparison.

  • 🔬 Functional Annotation by Similarity
    Assigns Gene Ontology (GO) terms to proteins based on embedding space similarity, using pre-trained embeddings.

Pipeline Overview (Simplified)

  1. Embedding Generation
    Computes protein embeddings using deep learning models (ProtT5, ProstT5, ESM2 and Ankh).

  2. GO Term Lookup
    Performs vector similarity searches using in-memory computations to assign Gene Ontology terms. Reference embeddings are retrieved from a PostgreSQL database with pgvector. Only experimental evidence codes are used for transfer.

📚 Supported Embedding Models

Name Model ID Params Architecture Description
ESM-2 facebook/esm2_t33_650M_UR50D 650M Encoder (33L) Learns structure/function from UniRef50. No MSAs. Optimized for accuracy.
ProtT5 Rostlab/prot_t5_xl_uniref50 1.2B Encoder-Decoder Trained on UniRef50. Strong transfer for structure/function tasks.
ProstT5 Rostlab/ProstT5 1.2B Multi-modal T5 Learns 3Di structural states + function. Enhances contact/function tasks.
Ankh3-Large ElnaggarLab/ankh3-large 620M Encoder (T5-style) Fast inference. Good semantic/structural representation.
ESM3c esmc_600m 600M Encoder (36L) New gen. model trained on UniRef + MGnify + JGI. High precision & speed.

Acknowledgments

FANTASIA is the result of a collaborative effort between Ana Rojas’ Lab (CBBIO) (Andalusian Center for Developmental Biology, CSIC) and Rosa Fernández’s Lab (Metazoa Phylogenomics Lab, Institute of Evolutionary Biology, CSIC-UPF). This project demonstrates the synergy between research teams with diverse expertise.

This version of FANTASIA builds upon previous work from:

  • Metazoa Phylogenomics Lab's FANTASIA
    The original implementation of FANTASIA for functional annotation.

  • bio_embeddings
    A state-of-the-art framework for generating protein sequence embeddings.

  • GoPredSim
    A similarity-based approach for Gene Ontology annotation.

  • protein-information-system
    Serves as the reference biological information system, providing a robust data model and curated datasets for protein structural and functional analysis.

We also extend our gratitude to LifeHUB-CSIC for inspiring this initiative and fostering innovation in computational biology.

Citing FANTASIA

If you use FANTASIA in your research, please cite the following publications:

  1. Martínez-Redondo, G. I., Barrios, I., Vázquez-Valls, M., Rojas, A. M., & Fernández, R. (2024).
    Illuminating the functional landscape of the dark proteome across the Animal Tree of Life.
    DOI: 10.1101/2024.02.28.582465

  2. Barrios-Núñez, I., Martínez-Redondo, G. I., Medina-Burgos, P., Cases, I., Fernández, R., & Rojas, A. M. (2024).
    Decoding proteome functional information in model organisms using protein language models.
    DOI: 10.1101/2024.02.14.580341


👥 Project Team

🔧 Technical Team

  • Francisco Miguel Pérez Canales: fmpercan@upo.es
    Author of the system’s engineering and technical implementation
  • Francisco J. Ruiz Mota: fraruimot@alum.us.es
    Junior developer

🧬 Scientific Team & Original Authors of FANTASIA v1


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fantasia-3.0.1.tar.gz (30.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

fantasia-3.0.1-py3-none-any.whl (31.2 kB view details)

Uploaded Python 3

File details

Details for the file fantasia-3.0.1.tar.gz.

File metadata

  • Download URL: fantasia-3.0.1.tar.gz
  • Upload date:
  • Size: 30.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.4.0 CPython/3.12.11 Linux/6.8.0-1031-azure

File hashes

Hashes for fantasia-3.0.1.tar.gz
Algorithm Hash digest
SHA256 43655bfacf8de5d91bea28f6a0a770982f45beb5b6a273f3c317bfc709e7d351
MD5 9f28e00efe4246928720fd9859e08977
BLAKE2b-256 a59891023cf626d63ec9438fd6f5da7e0502925fd527e79052b68eb2ae68130d

See more details on using hashes here.

File details

Details for the file fantasia-3.0.1-py3-none-any.whl.

File metadata

  • Download URL: fantasia-3.0.1-py3-none-any.whl
  • Upload date:
  • Size: 31.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.4.0 CPython/3.12.11 Linux/6.8.0-1031-azure

File hashes

Hashes for fantasia-3.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 18897d64d3a7fac9491c21b72b08ac13aeb2dbc41d3fbb9091c44ae1d6e4665e
MD5 4d2e7daff51fc940a3b314abb837f11b
BLAKE2b-256 769d46571928e05ff72702330a5fb2466e573e71d31484a95e527d6db6d1bb5d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page