Skip to main content

A CLI tool for predicting lncRNA–Protein interactions using transformer embeddings and CatBoost

Project description

🧬 lncrna-PI - LncRNA–Protein Interaction Prediction

lncrnaPI is a command-line tool for predicting lncRNA–Protein interactions using pre-trained language models (DNABERT-2 and ESM-2) for sequence embedding and a CatBoost classifier for interaction probability estimation.


🚀 Overview

This standalone script enables large-scale prediction of interactions between lncRNA and protein sequences.
It leverages state-of-the-art transformer models to extract biologically meaningful embeddings and a pre-trained CatBoost model to compute interaction probabilities.


📦 Features

  • Supports FASTA input for lncRNA and protein sequences.
  • Generates embeddings using:
    • 🧬 DNABERT-2 (zhihan1996/DNABERT-2-117M) for lncRNAs
    • 🧫 ESM-2 (facebook/esm2_t30_150M_UR50D) for proteins
  • Predicts interaction probabilities using a CatBoost classifier.
  • Supports GPU acceleration (CUDA / MPS) for faster inference.
  • Outputs results in CSV format.

🧰 Installation

Install the package using the following code from the command line:

pip install lncrnapi

⚙️ Usage

Run the script directly from the command line:

lncrnapi  --lncrna_fasta /path/to/lncrnas.fasta     --protein_fasta /path/to/proteins.fasta     --model_path /path/to/saved_model.joblib     --output_file /path/to/results.csv

Arguments

Argument Description Required
--lncrna_fasta Path to the FASTA file containing lncRNA sequences.
--protein_fasta Path to the FASTA file containing protein sequences.
--model_path Path to the pre-trained CatBoost model file (.cbm, .joblib, or .pkl).
--output_file Path to save the CSV file with predicted probabilities.

🧠 How It Works

  1. Model Loading
    The tool loads the DNABERT-2 and ESM-2 models from Hugging Face.

  2. FASTA Parsing
    Extracts sequence IDs and corresponding sequences from input FASTA files.

  3. Embedding Generation
    Computes mean pooled embeddings for each sequence using transformer hidden states.

  4. Prediction
    Concatenates embeddings (lncRNA + protein) and predicts the interaction probability using the CatBoost model.

  5. Output
    Generates a .csv file containing:

    • LncRNA_ID
    • Protein_ID
    • Interaction_Probability

📊 Example Output

LncRNA_ID Protein_ID Interaction_Probability
lnc001 P12345 0.9421
lnc002 Q8N6T7 0.3175

⚡ Hardware Acceleration

The script automatically detects and uses available hardware:

  • CUDA GPU (NVIDIA)
  • MPS (Apple Silicon)
  • ⚠️ CPU (fallback)

🧩 Model Formats Supported

Format Description
.cbm Native CatBoost model format
.joblib Joblib-serialized model
.pkl Pickle-based serialized model

🛠 Troubleshooting

Issue Possible Cause Solution
Model file not found Wrong --model_path Check the file path
No sequences found in FASTA Invalid FASTA format Ensure > headers are present
safetensors error Missing library Install with pip install safetensors
Slow performance CPU usage Use GPU-enabled environment

📁 Output Example

$ head results.csv
LncRNA_ID,Protein_ID,Interaction_Probability
lnc001,P12345,0.9421
lnc002,Q8N6T7,0.3175
lnc003,O76074,0.7814

📜 Citation

If you use this tool in your research, please cite:

Your Name et al.
A Deep Learning Framework for lncRNA–Protein Interaction Prediction Using Transformer-Based Sequence Embeddings (2025)


🧩 Repository Structure

├── predict_interaction.py       # Main CLI script
├── README.md                    # Documentation
└── example/
    ├── lncrnas.fasta
    ├── proteins.fasta
    └── saved_model.joblib

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lncrnapi-1.0.0.tar.gz (758.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lncrnapi-1.0.0-py3-none-any.whl (760.8 kB view details)

Uploaded Python 3

File details

Details for the file lncrnapi-1.0.0.tar.gz.

File metadata

  • Download URL: lncrnapi-1.0.0.tar.gz
  • Upload date:
  • Size: 758.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.18

File hashes

Hashes for lncrnapi-1.0.0.tar.gz
Algorithm Hash digest
SHA256 4a993596d44f3ed74d367abbb16bbd52c21734d2675a02b0fe68cc9a9e3f4c2b
MD5 9542448af2745f05cafa528aa7602412
BLAKE2b-256 b1946ebf346279a08ee4b1eed4e229e368ea151c6e515c2ff15dc6fdd455dba3

See more details on using hashes here.

File details

Details for the file lncrnapi-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: lncrnapi-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 760.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.18

File hashes

Hashes for lncrnapi-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e905ab1fe3b1fac4ae09abd0b45985f8a93ff8c862c829400b9475e5da506822
MD5 386f2fa0bd08f8d6d0a7050c82018ddc
BLAKE2b-256 c4948dd6022719b5f96ea292e86b3c98d85ed44adee0c5caf071050f89e6a5bc

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page