A SUBLYME pipeline for Uncovering Bacteriophage Lysins in Metagenomic Datasets
Project description
SUBLYME
Table of Contents
About the Project
SUBLYME is a tool to identify bacteriophage lysins. It utilizes the highly informative ProtT5 protein embeddings to make predictions and was trained using proteins in the PHALP database.
Getting started
SUBLYME has been packaged in PyPI for ease of use. The source code can be downloaded from GitHub.
Prerequisites
A GPU is recommended to compute embeddings for large datasets.
The full list of dependencies can be found in requirements.txt.
Dependencies are taken care of by pip.
python/3.11.5
joblib==1.2.0
numpy==1.26.4
pandas==2.2.1
torch==2.3.0
scipy==1.13.1
scikit-learn==1.3.0
transformers==4.43.1
sentencepiece==0.2.0
Installation
First create a virtual environment in python 3.11.5. For example:
conda create -n sublyme_env python=3.11.5
conda activate sublyme_env
From pypi:
pip install sublyme
Usage
sublyme test/input.faa -t 4
From source:
git clone https://github.com/Rousseau-Team/sublyme.git
cd sublyme
pip install -e requirements.txt
ex. python3 src/sublyme/sublyme.py test/input.faa -t 4 --models_folder src/sublyme/models
Usage details
A fasta file of protein sequences or a csv file of protein embeddings can be used as input.
Specifying the option --only_embeddings will only compute embeddings. This step is much faster with a GPU. The embeddings file can then be reinputted using the same command (without --only_embeddings) and specifying the new file as input file.
Options:
- input_file: Path to input file containing protein sequences (.fa*) or protein embeddings (.csv) that you wish to annotate.
- --threads (-t): Number of threads (default 1).
- --output_folder (-o): Path to the output folder. Default folder is ./outputs/.
- --models_folder (-m): Path to folder containing pretrained models (lysin_miner.pkl, val_endo_clf.pkl). Default is src/sublyme/models.
- --only_embeddings: Whether to only calculate embeddings (no lysin prediction).
Output format
The output consists of a csv file with a column for the final prediction and one column each for probabilities associated to lysins, endolysins and VALs.
Ex.
| pred | lysin | endolysin | VAL |
|---|---|---|---|
| lysin|endolysin | 0.98 | 0.95 | 0.05 |
| Na | 0.01 | Na | Na |
Note that the endolysin/VAL classifier is one multiclass classifier, implying that their probabilities will always add up to one and that the classifier will always assign one of these to be true.
Also, the endolysin/VAL classifier is only applied to proteins first predicted as being lysins (lysin proba >0.5).
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sublyme-1.1.tar.gz.
File metadata
- Download URL: sublyme-1.1.tar.gz
- Upload date:
- Size: 8.8 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
34d22deca94b7dda1c21a5a5ea6c4b68032dd168e24ee0376c3faf2c7812b29a
|
|
| MD5 |
abf388c3b2845d82a9561265d30e6fbb
|
|
| BLAKE2b-256 |
d3e3640833c7b2a3cb626059cbab63c62fa53ddb88b66be3c6270fd94de14592
|
File details
Details for the file sublyme-1.1-py3-none-any.whl.
File metadata
- Download URL: sublyme-1.1-py3-none-any.whl
- Upload date:
- Size: 8.8 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
316eb9de08e180c4f36b74b1e1d787b4080a057e9b8ee66c83acb29ed499e567
|
|
| MD5 |
0837acb58bef8c6f3610cd3d68bea9c7
|
|
| BLAKE2b-256 |
42bd7a986c7b4f0b0a5294dd85c0ba989fa4e9f2de6cc7d50f8f326e571f4910
|