Functional ANnoTAtion based on embedding space SImilArity

These details have not been verified by PyPI

Project description

FANTASIA

FANTASIA Logo

Introduction

FANTASIA (Functional ANnoTAtion based on embedding space SImilArity) is a pipeline for annotating Gene Ontology (GO) terms for protein sequences using advanced protein language models like ProtT5, ProstT5, and ESM2. This system automates complex workflows, from sequence processing to functional annotation, providing a scalable and efficient solution for protein structure and functionality analysis.

Introduction
Key Features
Prerequisites
Step 1: Clone the Repository
Step 2: Create and Activate a Virtual Environment
Step 3: Start Services
Step 4: Configuration
Step 5: Initialization
Step 6: Run the Pipeline
Documentation
Citation
Contact Information

Key Features

Redundancy Filtering: Removes identical sequences with CD-HIT and optionally excludes sequences based on length constraints.
Embedding Generation: Utilizes state-of-the-art models for protein sequence embeddings.
GO Term Lookup: Matches embeddings with a vector database to retrieve associated GO terms.
Results: Outputs transferred annotations with the correspondant distance matrix

Prerequisites

Operating System: Updated Linux (Ubuntu recommended).
Python: Version 3.10 or higher installed.
Poetry: Installed for dependency management:
```
pip install poetry
```
Docker: Installed and running. If not installed, follow the Docker installation guide.
NVIDIA Driver: Version 550.120 or newer (verify using nvidia-smi).
CUDA: Version 12.4 or newer installed (verify using nvcc --version).

Step 1: Clone the Repository

git clone https://github.com/CBBIO/FANTASIA.git
cd FANTASIA

Step 2: Create and Activate a Virtual Environment

Let poetry manage the virtual environment.

poetry install
poetry shell

Step 3: Start Services

To ensure the PostgreSQL and RabbitMQ services are running, use the following commands to start the containers:

Start PostgreSQL with pgvector

Run the following command to start a PostgreSQL container with the pgvector extension:

docker run -d --name pgvectorsql \
    -e POSTGRES_USER=usuario \
    -e POSTGRES_PASSWORD=clave \
    -e POSTGRES_DB=BioData \
    -p 5432:5432 \
    pgvector/pgvector:pg16

Start RabbitMQ

Run the following command to start a RabbitMQ container:

docker run -d --name rabbitmq \
    -p 15672:15672 \
    -p 5672:5672 \
    rabbitmq:management

You can access the RabbitMQ management interface at http://localhost:15672 using the default credentials (guest/guest).

Step 4: Configuration

Before proceeding, create the necessary directories with proper permissions:

mkdir -p ~/fantasia/dumps ~/fantasia/embeddings ~/fantasia/results ~/fantasia/redundancy
chmod -R 755 ~/fantasia

Ensure the following parameters are correctly set in the config.yaml :

System Settings

max_workers: 1
constants: "./fantasia/constants.yaml"  # Auxiliary file for the information system, used to add or remove models in this pipeline.

PostgreSQL Configuration

DB_USERNAME: usuario
DB_PASSWORD: clave
DB_HOST: pgvectorsql
DB_PORT: 5432
DB_NAME: BioData

RabbitMQ Configuration

rabbitmq_host: rabbitmq
rabbitmq_user: guest
rabbitmq_password: guest

Database Dump Source

embeddings_url: "https://zenodo.org/records/14546346/files/embeddings.tar?download=1"

Paths

Pay special attention to the paths you configure for FANTASIA:

~/fantasia: This is used for input, intermediary, and output files. Ensure that this directory exists and has the correct permissions.
./fantasia: Refers to the project root directory where configuration files and scripts reside.

Properly managing these paths ensures smooth execution of the pipeline and prevents errors related to missing files or directories.

embeddings_path: ~/fantasia/dumps/
fantasia_output_h5: ~/fantasia/embeddings/
fantasia_output_csv: ~/fantasia/results/
redundancy_file: ~/fantasia/redundancy/output.fasta

Step 5: Initialization

Download embeddings and load the database:

python fantasia/main.py initialize --config ./fantasia/config.yaml

Verify that the data has been downloaded and loaded into:
- The folder defined in embeddings_path.
- The configured PostgreSQL database.

Step 6: Run the Pipeline

Before running the pipeline, ensure the necessary input file is placed in the correct location. Copy the zinc_fingers.fasta file from the data_sample directory to the expected input directory:

mkdir -p ~/fantasia/input
cp ./data_sample/zinc_fingers.fasta ~/fantasia/input/zinc_fingers.fasta

Run the pipeline using an input FASTA file and the following command:

python fantasia/main.py run \
  --fasta ~/fantasia/input/zinc_fingers.fasta \
  --prefix finger_zinc \
  --length_filter 5000 \
  --redundancy_filter 0.65 \
  --sequence_queue_package 200 \
  --esm \
  --prost \
  --prot \
  --distance_threshold 1:1.2,2:0.7,3:0.7 \
  --batch_size 1:50,2:60,3:40

Explanation of Parameters

--fasta: Specifies the path to the input FASTA file containing protein sequences. In this case: ~/fantasia/input/zinc_fingers.fasta.
--prefix: Sets the prefix for naming the output files. Here, the prefix is finger_zinc.
--length_filter: Filters out sequences longer than 5000 amino acids.
--redundancy_filter: Removes redundant sequences with a similarity threshold of 0.65.
--sequence_queue_package: Defines the number of sequences to be processed per queue package (e.g., 200 sequences).
--esm, --prost, --prot: Enables the use of the specified models (ESM, Prost, Prot).
--distance_threshold: Sets the maximum allowed distances for similarity matching, specific to each model. Here:
- Model 1 (ESM): 1.2
- Model 2 (Prost): 0.7
- Model 3 (Prot): 0.7
--batch_size: Specifies the batch sizes for embedding generation, tailored per model. Here:
- Model 1 (ESM): 50
- Model 2 (Prost): 60
- Model 3 (Prot): 40

Output

Results will be stored in the paths specified under:

fantasia_output_h5: HDF5 embeddings.
fantasia_output_csv: Processed results.

Documentation

(Work In Progress)

For complete details on pipeline configuration, parameters, and deployment, visit the FANTASIA Documentation.

Citation

If you use FANTASIA in your work, please cite the following:

Martínez-Redondo, G. I., Barrios, I., Vázquez-Valls, M., Rojas, A. M., & Fernández, R. (2024). Illuminating the functional landscape of the dark proteome across the Animal Tree of Life.
https://doi.org/10.1101/2024.02.28.582465.
Barrios-Núñez, I., Martínez-Redondo, G. I., Medina-Burgos, P., Cases, I., Fernández, R. & Rojas, A.M. (2024). Decoding proteome functional information in model organisms using protein language models.
https://doi.org/10.1101/2024.02.14.580341.

Contact Information

Francisco Miguel Pérez Canales: fmpercan@upo.es
Gemma I. Martínez-Redondo: gemma.martinez@ibe.upf-csic.es
Ana M. Rojas: a.rojas.m@csic.es
Rosa Fernández: rosa.fernandez@ibe.upf-csic.es

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

4.1.1

Feb 18, 2026

4.1.0

Jan 21, 2026

4.0.3

Oct 20, 2025

4.0.2 yanked

Oct 9, 2025

4.0.1 yanked

Sep 29, 2025

4.0.0

Oct 20, 2025

3.0.1

Sep 17, 2025

3.0.0

Jul 29, 2025

2.8.7

Jul 18, 2025

2.8.2

Jul 8, 2025

2.8.1

Jun 30, 2025

2.8.0

Jun 20, 2025

2.7.0

Jun 9, 2025

2.6.0

Jun 3, 2025

2.5.0

May 16, 2025

2.4.0

May 13, 2025

2.3.0

May 13, 2025

2.2.0

May 13, 2025

2.1.0 yanked

May 7, 2025

1.8.0

May 2, 2025

1.7.0

Apr 16, 2025

1.6.0

Apr 16, 2025

1.5.0

Apr 16, 2025

1.4.0

Apr 10, 2025

1.3.0

Apr 8, 2025

1.2.0

Apr 4, 2025

1.1.0

Apr 4, 2025

1.0.0

Apr 4, 2025

0.13.3

Mar 15, 2025

0.13.2

Mar 15, 2025

0.13.1

Mar 14, 2025

0.13.0

Mar 14, 2025

0.12.0

Mar 14, 2025

0.11.0

Mar 14, 2025

0.10.0

Mar 13, 2025

0.9.0

Feb 25, 2025

0.8.0

Feb 18, 2025

0.7.0

Feb 13, 2025

0.5.0

Jan 14, 2025

0.4.0

Jan 14, 2025

This version

0.3.0

Jan 13, 2025

0.2.0

Jan 10, 2025

0.1.0

Jan 9, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fantasia-0.3.0.tar.gz (15.8 kB view details)

Uploaded Jan 13, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

fantasia-0.3.0-py3-none-any.whl (15.7 kB view details)

Uploaded Jan 13, 2025 Python 3

File details

Details for the file fantasia-0.3.0.tar.gz.

File metadata

Download URL: fantasia-0.3.0.tar.gz
Upload date: Jan 13, 2025
Size: 15.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.4.0 CPython/3.10.16 Linux/6.5.0-1025-azure

File hashes

Hashes for fantasia-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`6d9cb4a06128628301a750859f4ba992578cf537c853f8329db1044d29abf80d`
MD5	`b5407c2791ea88ceae7fac8069e4b246`
BLAKE2b-256	`417a94eb483dffb90b20ab13c678f31c5fc049d1829a71bd344758d65b24f6f9`

See more details on using hashes here.

File details

Details for the file fantasia-0.3.0-py3-none-any.whl.

File metadata

Download URL: fantasia-0.3.0-py3-none-any.whl
Upload date: Jan 13, 2025
Size: 15.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.4.0 CPython/3.10.16 Linux/6.5.0-1025-azure

File hashes

Hashes for fantasia-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e619a7090662d40cd59d95078f7776410fecd04ccc9336f50bc21bfb7ccfa48d`
MD5	`142012e996ff8f12cb764640a173f59f`
BLAKE2b-256	`47b02ed03e6688d4a1d4081629453ef95d559afd72585eb92e91b1cd6129a478`

See more details on using hashes here.

fantasia 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

FANTASIA

Introduction

Table of Contents

Key Features

Prerequisites

Step 1: Clone the Repository

Step 2: Create and Activate a Virtual Environment

Step 3: Start Services

Start PostgreSQL with pgvector

Start RabbitMQ

Step 4: Configuration

System Settings

PostgreSQL Configuration

RabbitMQ Configuration

Database Dump Source

Paths

Step 5: Initialization

Step 6: Run the Pipeline

Explanation of Parameters

Output

Documentation

Citation

Contact Information

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes