Skip to main content

An embedding-based phage protein annotation tool by hierarchical assignment

Project description

Empathi
Embedding-based Phage Protein Annotation Tool by Hierarchical Assignment

Table of Contents
  1. About the Project
  2. Getting Started
  3. Usage details

About the Project

Empathi is a tool for the prediction of bacteriophage protein functions. It utilizes the highly informative ProtT5 protein embeddings to make predictions. In addition, new functional groups were defined to be better suited for machine-learning than the often-overlapping PHROG categories.

A preprint is available here.

Getting Started

Empathi has been packaged in PyPI and as an Apptainer container for ease of use.
The source code can also be downloaded from HuggingFace.

Prerequisites

The full list of dependencies and versions can be found in requirements.txt.

Either git-lfs or Apptainer will be required. See instructions below.

Other dependencies are taken care of by pip and Apptainer.

python/3.11.5
joblib==1.2.0
numpy==1.26.4
pandas==2.2.1
torch==2.3.0
scipy==1.13.1
scikit-learn==1.5.0
transformers==4.43.1
sentencepiece==0.2.0

Installation

There are three ways of installing Empathi: through PyPI, as an Apptainer container or as source code.

1. PIP

First, create a virtual environment in python 3.11.5.

conda create -n empathi_env python=3.11.5
conda activate empathi_env

Download models for Empathi. You will need git-lfs: for WSL or linux use sudo apt-get install git-lfs, for windows either use git bash or get it from here. Then:

git lfs install
git clone https://huggingface.co/AlexandreBoulay/empathi
export PATH="/path/to/empathi/models:$PATH"

Install dependencies:

pip install empathi

Usage

empathi input_file name

2. Apptainer

Download Apptainer or singularity. On windows, this will require a virtual machine. WSL works well.

Fetch Empathi from Sylabs:

apptainer pull empathi.sif library://alexandreboulay/empathi/empathi

Usage

apptainer run empathi.sif path/to/input_file name

3. From source code

First, create a virtual environment in python 3.11.5.

conda create -n empathi_env python=3.11.5
conda activate empathi_env

Clone the repo. You will need git-lfs: for WSL or linux use sudo apt-get install git-lfs, for windows either use git bash or get it from here. Then:

git lfs install
git clone https://huggingface.co/AlexandreBoulay/empathi

Install dependencies:

cd empathi
pip install -r requirements.txt

Usage

python src/empathi/empathi.py input_file name

Usage details

A fasta file of protein sequences or a csv file of protein embeddings can be used as input.

Specifying the option --only_embeddings will only compute embeddings. This step is much faster with a GPU. The embeddings file can then be reinputted using the same command (without --only_embeddings) and specifying the new file as input file.

Options:

  • input_file: Path to input file containing protein sequencs (.fa*) or protein embeddings (.pkl/.csv).
  • name: Name of file you want to save to (wOut extension). Should be different between runs to avoid overwriting files.
  • --models_folder: Path to folder containing EmPATHi models. Can be left unspecified if it was added to PATH earlier.
  • --only_embeddings: Whether to only calculate embeddings (no functional prediction).
  • --output_folder: Path to the output folder. Default is ./empathi_out/.
  • --mode: Which types of proteins you want to predict. Accepted arguments are "all", "pvp", "rbp", "lysin", "regulator"...

When launching from python omit the '--' in front of args.

Output format

A csv file with an annotation column regrouping all assigned annotations per protein (separated by "|") and a column per functional category with the confidence associated to each prediction. A confidence >0.5 is used to assign functions, but a user could raise this criteria by filtering the output table.

Ex.

Annotation PVP cell wall depolymerase DNA-associated ...
PVP|cell wall depolymerase 0.98 0.99 0.005 ...
DNA-associated 0.01 0.05 0.998 ...

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

empathi-1.0.4.tar.gz (33.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

empathi-1.0.4-py3-none-any.whl (22.4 kB view details)

Uploaded Python 3

File details

Details for the file empathi-1.0.4.tar.gz.

File metadata

  • Download URL: empathi-1.0.4.tar.gz
  • Upload date:
  • Size: 33.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.5

File hashes

Hashes for empathi-1.0.4.tar.gz
Algorithm Hash digest
SHA256 b690e9aba01ab6460271e15c0a43c8df0cbceb9f71da2386f70afb3ba6fc017f
MD5 d398df81f20253664499acc630b8dfee
BLAKE2b-256 aa05d34763fa88d7fcf760218563c8c25b953cf4462071f0e599c38c260fe8c6

See more details on using hashes here.

File details

Details for the file empathi-1.0.4-py3-none-any.whl.

File metadata

  • Download URL: empathi-1.0.4-py3-none-any.whl
  • Upload date:
  • Size: 22.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.5

File hashes

Hashes for empathi-1.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 3061d0c1eb8ba33187de3ba1c830e3bfacae7665b204fc47824c6f61019ef594
MD5 fbb1e1ad6169d3228a2bf014f334c7f4
BLAKE2b-256 21f5503531b5019291778d034c4ff60c9d82732b303cb02e15fda78c97882b19

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page