Skip to main content

An embedding-based phage protein annotation tool by hierarchical assignment

Project description

Empathi
Embedding-based Phage Protein Annotation Tool by Hierarchical Assignment

Table of Contents
  1. About the Project
  2. Getting Started
  3. Usage details

About the Project

Empathi is a tool for the prediction of bacteriophage protein functions. It utilizes the highly informative ProtT5 protein embeddings to make predictions. In addition, new functional groups were defined to be better suited for machine-learning than the often-overlapping PHROG categories.

A preprint is available here.

Getting Started

Empathi has been packaged in PyPI and as an Apptainer container for ease of use.
The source code can also be downloaded from HuggingFace.

Prerequisites

A GPU is recommended for large datasets.

The full list of dependencies and versions can be found in requirements.txt.

Either git-lfs or Apptainer will be required. See instructions below.

Other dependencies are taken care of by pip and Apptainer.

python/3.11.5
joblib==1.2.0
numpy==1.26.4
pandas==2.2.1
torch==2.3.0
scipy==1.13.1
scikit-learn==1.5.0
transformers==4.43.1
sentencepiece==0.2.0

Installation

There are three ways of installing Empathi: through PyPI, as an Apptainer container or as source code. Installation should take less than 10 minutes. A small fasta file is provided to test installation. This should run in <1 minute.

1. PIP

First, create a virtual environment in python 3.11.5.

conda create -n empathi_env python=3.11.5
conda activate empathi_env

Download models for Empathi. You will need git-lfs: for WSL or linux use sudo apt-get install git-lfs, for windows either use git bash or get it from here. Then:

git lfs install
git clone https://huggingface.co/AlexandreBoulay/empathi
export PATH="/path/to/empathi/models:$PATH"

Install dependencies:

pip install empathi

Usage

empathi input_file name

2. Apptainer

Download Apptainer or singularity. On windows, this will require a virtual machine. WSL works well.

Fetch Empathi from Sylabs:

apptainer pull empathi.sif library://alexandreboulay/empathi/empathi

Usage

apptainer run empathi.sif path/to/input_file name --confidence 0.95

3. From source code

First, create a virtual environment in python 3.11.5.

conda create -n empathi_env python=3.11.5
conda activate empathi_env

Clone the repo. You will need git-lfs: for WSL or linux use sudo apt-get install git-lfs, for windows either use git bash or get it from here. Then:

git lfs install
git clone https://huggingface.co/AlexandreBoulay/empathi

Install dependencies:

cd empathi
pip install -r requirements.txt

Usage

python src/empathi/empathi.py input_file name

Usage details

A fasta file of protein sequences or a csv file of protein embeddings can be used as input.

By default, a confidence >0.95 is used to assign functions. Using a high confidence threshold (--confidence) will result in more precise predictions (lower false positive rate), but also a lower sensitivity (less proteins assigned a function). If the objective of your study is to annotate as many proteins as possible, consider using a confidence threshold of as low as 0.5.

Specifying the option --only_embeddings will only compute embeddings. This step is much faster with a GPU. The embeddings file can then be reinputted using the same command (without --only_embeddings) and specifying the new file as input file.

Options:

  • input_file: Path to input file containing protein sequencs (.fa*) or protein embeddings (.pkl/.csv).
  • name: Name of file you want to save to (wOut extension). Should be different between runs to avoid overwriting files.
  • --models_folder: Path to folder containing EmPATHi models. Can be left unspecified if it was added to PATH earlier.
  • --only_embeddings: Whether to only calculate embeddings (no functional prediction).
  • --output_folder: Path to the output folder. Default is ./empathi_out/.
  • --threads: Number of threads (default 1).
  • --confidence: Confidence threshold used to assign predictions (default 0.95).
  • --mode: Which types of proteins you want to predict. Accepted arguments are "all", "pvp", "DNA-associated", "adsorption-related", "lysis", "regulator", "cell_wall_depolymerase", "packaging", "RNA-associated", "ejection", "phosphorylation", "transferase", "nucleotide_metabolism", "reductase" and "defense_systems".

Output format

The output consists of a csv file with an annotation column regrouping all assigned annotations per protein (separated by "|") and a column per functional category with the confidence associated to each prediction.

Ex.

Annotation PVP cell wall depolymerase DNA-associated ...
PVP|cell wall depolymerase 0.98 0.99 0.005 ...
DNA-associated 0.01 0.05 0.998 ...

Hierarchical classification

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

empathi-1.0.6.tar.gz (599.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

empathi-1.0.6-py3-none-any.whl (23.5 kB view details)

Uploaded Python 3

File details

Details for the file empathi-1.0.6.tar.gz.

File metadata

  • Download URL: empathi-1.0.6.tar.gz
  • Upload date:
  • Size: 599.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for empathi-1.0.6.tar.gz
Algorithm Hash digest
SHA256 3697ca354c906437b96373557f031d7e4759e730a3cba6e5715b3ae701232f89
MD5 d118868665534230731939655c960266
BLAKE2b-256 0c6c4ea9fe78a8ea73a2f4d39985c8aa285b62eb1a480fb5dc73cce80b909699

See more details on using hashes here.

File details

Details for the file empathi-1.0.6-py3-none-any.whl.

File metadata

  • Download URL: empathi-1.0.6-py3-none-any.whl
  • Upload date:
  • Size: 23.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for empathi-1.0.6-py3-none-any.whl
Algorithm Hash digest
SHA256 660a990e80452e50fa084c506942410beb7065bdc1e9ab253a302aefcd87c9b1
MD5 a818f29c3c454de3cfdeb087d47e0027
BLAKE2b-256 e70ae913aa60b18ada9627e29a433d17a53a786c3def84686fd5ffc1d6b35701

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page