Skip to main content

An embedding-based phage protein annotation tool by hierarchical assignment

Project description

Empathi
Embedding-based Phage Protein Annotation Tool by Hierarchical Assignment

Table of Contents
  1. About the Project
  2. Getting Started
  3. Usage details

About the Project

Empathi is a tool for the prediction of bacteriophage protein functions. It utilizes the highly informative ProtT5 protein embeddings to make predictions. In addition, new functional groups were defined to be better suited for machine-learning than the often-overlapping PHROG categories.

A preprint is available here.

Getting Started

Empathi has been packaged in PyPI and as an Apptainer container for ease of use.
The source code can also be downloaded from HuggingFace.

Prerequisites

A GPU is recommended for large datasets.

The full list of dependencies and versions can be found in requirements.txt.

Either git-lfs or Apptainer will be required. See instructions below.

Other dependencies are taken care of by pip and Apptainer.

python/3.11.5
joblib==1.2.0
numpy==1.26.4
pandas==2.2.1
torch==2.3.0
scipy==1.13.1
scikit-learn==1.5.0
transformers==4.43.1
sentencepiece==0.2.0

Installation

There are three ways of installing Empathi: through PyPI, as an Apptainer container or as source code. Installation should take less than 10 minutes. A small fasta file is provided to test installation. This should run in <1 minute.

1. PIP

First, create a virtual environment in python 3.11.5.

conda create -n empathi_env python=3.11.5
conda activate empathi_env

Download models for Empathi. You will need git-lfs: for WSL or linux use sudo apt-get install git-lfs, for windows either use git bash or get it from here. Then:

git lfs install
git clone https://huggingface.co/AlexandreBoulay/empathi
export PATH="/path/to/empathi/models:$PATH"

Install dependencies:

pip install empathi

Usage

empathi input_file name

2. Apptainer

Download Apptainer or singularity. On windows, this will require a virtual machine. WSL works well.

Fetch Empathi from Sylabs:

apptainer pull empathi.sif library://alexandreboulay/empathi/empathi

Usage

apptainer run empathi.sif path/to/input_file name

3. From source code

First, create a virtual environment in python 3.11.5.

conda create -n empathi_env python=3.11.5
conda activate empathi_env

Clone the repo. You will need git-lfs: for WSL or linux use sudo apt-get install git-lfs, for windows either use git bash or get it from here. Then:

git lfs install
git clone https://huggingface.co/AlexandreBoulay/empathi

Install dependencies:

cd empathi
pip install -r requirements.txt

Usage

python src/empathi/empathi.py input_file name

Usage details

A fasta file of protein sequences or a csv file of protein embeddings can be used as input.

Specifying the option --only_embeddings will only compute embeddings. This step is much faster with a GPU. The embeddings file can then be reinputted using the same command (without --only_embeddings) and specifying the new file as input file.

By default, a confidence >0.5 is used to assign functions. Raising the confidence threshold (--confidence) will lead to more precise predictions (lower false positive rate), but also a lower sensitivity (less proteins assigned a function).

Options:

  • input_file: Path to input file containing protein sequencs (.fa*) or protein embeddings (.pkl/.csv).
  • name: Name of file you want to save to (wOut extension). Should be different between runs to avoid overwriting files.
  • --models_folder: Path to folder containing EmPATHi models. Can be left unspecified if it was added to PATH earlier.
  • --only_embeddings: Whether to only calculate embeddings (no functional prediction).
  • --output_folder: Path to the output folder. Default is ./empathi_out/.
  • --threads: Number of threads (default 1).
  • --confidence: Confidence threshold used to assign predictions (default 50%).
  • --mode: Which types of proteins you want to predict. Accepted arguments are "all", "pvp", "DNA-associated", "adsorption-related", "lysis", "regulator", "cell_wall_depolymerase", "packaging", "RNA-associated", "ejection", "phosphorylation", "transferase", "nucleotide_metabolism", "reductase" and "defense_systems".

Output format

The output consists of a csv file with an annotation column regrouping all assigned annotations per protein (separated by "|") and a column per functional category with the confidence associated to each prediction.

Ex.

Annotation PVP cell wall depolymerase DNA-associated ...
PVP|cell wall depolymerase 0.98 0.99 0.005 ...
DNA-associated 0.01 0.05 0.998 ...

Hierarchical classification

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

empathi-1.0.5.tar.gz (599.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

empathi-1.0.5-py3-none-any.whl (23.4 kB view details)

Uploaded Python 3

File details

Details for the file empathi-1.0.5.tar.gz.

File metadata

  • Download URL: empathi-1.0.5.tar.gz
  • Upload date:
  • Size: 599.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for empathi-1.0.5.tar.gz
Algorithm Hash digest
SHA256 daf1d0265a97cfc2b1f9c46f386f0719456d98164ef030aeda211fd401112fb9
MD5 d50e78a6a7cae5d406e510477e23ccc0
BLAKE2b-256 63e1503204917efb0b91fd71e9f03b8c3c59a2c55c5c69272393e47ca105610c

See more details on using hashes here.

File details

Details for the file empathi-1.0.5-py3-none-any.whl.

File metadata

  • Download URL: empathi-1.0.5-py3-none-any.whl
  • Upload date:
  • Size: 23.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for empathi-1.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 2c9102b303dbec8ae85434db5d9e63b77670251d837988daf9d24a5250642b29
MD5 d2f86518bc2b214ffad68ca7eb64db8a
BLAKE2b-256 13edc1ff9c44582b29b7da8835ed7379d1a049124f48fb68c0a1d65a00bc1dca

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page