Skip to main content

A SUBLYME pipeline for Uncovering Bacteriophage Lysins in Metagenomic Datasets

Project description

SUBLYME

Software for Uncovering Bacteriophage LYsins in MEtagenomic datasets

Table of Contents
  1. About the Project
  2. Getting Started
  3. Usage details
  4. Output format

About the Project

SUBLYME is a tool to identify bacteriophage lysins. It utilizes the highly informative ProtT5 protein embeddings to make predictions and was trained using proteins in the PHALP database.

Getting started

SUBLYME has been packaged in PyPI for ease of use. The source code can be downloaded from GitHub.

Prerequisites

A GPU is recommended to compute embeddings for large datasets.

The full list of dependencies can be found in requirements.txt.

Dependencies are taken care of by pip.

python/3.11.5
joblib==1.2.0
numpy==1.26.4
pandas==2.2.1
torch==2.3.0
scipy==1.13.1
scikit-learn==1.3.0
transformers==4.43.1
sentencepiece==0.2.0

Installation

First create a virtual environment in python 3.11.5. For example:

conda create -n sublyme_env python=3.11.5
conda activate sublyme_env

From pypi:

pip install sublyme

Usage

sublyme test/input.faa -t 4

From source:

git clone https://github.com/Rousseau-Team/sublyme.git
cd sublyme
pip install -e requirements.txt

ex. python3 src/sublyme/sublyme.py test/input.faa -t 4 --models_folder src/sublyme/models

Usage details

A fasta file of protein sequences or a csv file of protein embeddings can be used as input.

Specifying the option --only_embeddings will only compute embeddings. This step is much faster with a GPU. The embeddings file can then be reinputted using the same command (without --only_embeddings) and specifying the new file as input file.

Options:

  • input_file: Path to input file containing protein sequences (.fa*) or protein embeddings (.csv) that you wish to annotate.
  • --threads (-t): Number of threads (default 1).
  • --output_folder (-o): Path to the output folder. Default folder is ./outputs/.
  • --models_folder (-m): Path to folder containing pretrained models (lysin_miner.pkl, val_endo_clf.pkl). Default is src/sublyme/models.
  • --only_embeddings: Whether to only calculate embeddings (no lysin prediction).

Output format

The output consists of a csv file with a column for the final prediction and one column each for probabilities associated to lysins, endolysins and VALs.

Ex.

pred lysin endolysin VAL
lysin|endolysin 0.98 0.95 0.05
Na 0.01 Na Na

Note that the endolysin/VAL classifier is one multiclass classifier, implying that their probabilities will always add up to one and that the classifier will always assign one of these to be true.

Also, the endolysin/VAL classifier is only applied to proteins first predicted as being lysins (lysin proba >0.5).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sublyme-1.1.tar.gz (8.8 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sublyme-1.1-py3-none-any.whl (8.8 MB view details)

Uploaded Python 3

File details

Details for the file sublyme-1.1.tar.gz.

File metadata

  • Download URL: sublyme-1.1.tar.gz
  • Upload date:
  • Size: 8.8 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for sublyme-1.1.tar.gz
Algorithm Hash digest
SHA256 34d22deca94b7dda1c21a5a5ea6c4b68032dd168e24ee0376c3faf2c7812b29a
MD5 abf388c3b2845d82a9561265d30e6fbb
BLAKE2b-256 d3e3640833c7b2a3cb626059cbab63c62fa53ddb88b66be3c6270fd94de14592

See more details on using hashes here.

File details

Details for the file sublyme-1.1-py3-none-any.whl.

File metadata

  • Download URL: sublyme-1.1-py3-none-any.whl
  • Upload date:
  • Size: 8.8 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for sublyme-1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 316eb9de08e180c4f36b74b1e1d787b4080a057e9b8ee66c83acb29ed499e567
MD5 0837acb58bef8c6f3610cd3d68bea9c7
BLAKE2b-256 42bd7a986c7b4f0b0a5294dd85c0ba989fa4e9f2de6cc7d50f8f326e571f4910

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page