A CLI tool for predicting lncRNA–Protein interactions using transformer embeddings and CatBoost
Project description
🧬 lncrna-PI - LncRNA–Protein Interaction Prediction
lncrnaPI is a command-line tool for predicting lncRNA–Protein interactions using pre-trained language models (DNABERT-2 and ESM-2) for sequence embedding and a CatBoost classifier for interaction probability estimation.
🚀 Overview
This standalone script enables large-scale prediction of interactions between lncRNA and protein sequences.
It leverages state-of-the-art transformer models to extract biologically meaningful embeddings and a pre-trained CatBoost model to compute interaction probabilities.
📦 Features
- Supports FASTA input for lncRNA and protein sequences.
- Generates embeddings using:
- 🧬 DNABERT-2 (
zhihan1996/DNABERT-2-117M) for lncRNAs - 🧫 ESM-2 (
facebook/esm2_t30_150M_UR50D) for proteins
- 🧬 DNABERT-2 (
- Predicts interaction probabilities using a CatBoost classifier.
- Supports GPU acceleration (CUDA / MPS) for faster inference.
- Outputs results in CSV format.
🧰 Installation
Install the package using the following code from the command line:
pip install lncrnapi
⚙️ Usage
Run the script directly from the command line:
lncrnapi --lncrna_fasta /path/to/lncrnas.fasta --protein_fasta /path/to/proteins.fasta --model_path /path/to/saved_model.joblib --output_file /path/to/results.csv
Arguments
| Argument | Description | Required |
|---|---|---|
--lncrna_fasta |
Path to the FASTA file containing lncRNA sequences. | ✅ |
--protein_fasta |
Path to the FASTA file containing protein sequences. | ✅ |
--model_path |
Path to the pre-trained CatBoost model file (.cbm, .joblib, or .pkl). |
✅ |
--output_file |
Path to save the CSV file with predicted probabilities. | ✅ |
🧠 How It Works
-
Model Loading
The tool loads the DNABERT-2 and ESM-2 models from Hugging Face. -
FASTA Parsing
Extracts sequence IDs and corresponding sequences from input FASTA files. -
Embedding Generation
Computes mean pooled embeddings for each sequence using transformer hidden states. -
Prediction
Concatenates embeddings (lncRNA + protein) and predicts the interaction probability using the CatBoost model. -
Output
Generates a.csvfile containing:LncRNA_IDProtein_IDInteraction_Probability
📊 Example Output
| LncRNA_ID | Protein_ID | Interaction_Probability |
|---|---|---|
| lnc001 | P12345 | 0.9421 |
| lnc002 | Q8N6T7 | 0.3175 |
⚡ Hardware Acceleration
The script automatically detects and uses available hardware:
- ✅ CUDA GPU (NVIDIA)
- ✅ MPS (Apple Silicon)
- ⚠️ CPU (fallback)
🧩 Model Formats Supported
| Format | Description |
|---|---|
.cbm |
Native CatBoost model format |
.joblib |
Joblib-serialized model |
.pkl |
Pickle-based serialized model |
🛠 Troubleshooting
| Issue | Possible Cause | Solution |
|---|---|---|
Model file not found |
Wrong --model_path |
Check the file path |
No sequences found in FASTA |
Invalid FASTA format | Ensure > headers are present |
safetensors error |
Missing library | Install with pip install safetensors |
| Slow performance | CPU usage | Use GPU-enabled environment |
📁 Output Example
$ head results.csv
LncRNA_ID,Protein_ID,Interaction_Probability
lnc001,P12345,0.9421
lnc002,Q8N6T7,0.3175
lnc003,O76074,0.7814
📜 Citation
If you use this tool in your research, please cite:
Your Name et al.
A Deep Learning Framework for lncRNA–Protein Interaction Prediction Using Transformer-Based Sequence Embeddings (2025)
🧩 Repository Structure
├── predict_interaction.py # Main CLI script
├── README.md # Documentation
└── example/
├── lncrnas.fasta
├── proteins.fasta
└── saved_model.joblib
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file lncrnapi-0.1.5.tar.gz.
File metadata
- Download URL: lncrnapi-0.1.5.tar.gz
- Upload date:
- Size: 758.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.18
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2db81e4d4105b120360f556d9c255e770cbcb820fa9f68b2945dbc0728503fbc
|
|
| MD5 |
b54833fe1cc626959f42cfb7f63dc5c4
|
|
| BLAKE2b-256 |
07c7eba5f72cc86cbbeae98e877eacd7d07ee257cde7c883b40bfcbbfe580d22
|
File details
Details for the file lncrnapi-0.1.5-py3-none-any.whl.
File metadata
- Download URL: lncrnapi-0.1.5-py3-none-any.whl
- Upload date:
- Size: 760.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.18
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4d7d4f511c301a275782d98e1f2a7120e032b3eef962e1f607469ad266e444d5
|
|
| MD5 |
3bed84d19aa4416c9542dff0d72d9255
|
|
| BLAKE2b-256 |
7081d6dec13d7393376dde7e8f6184196289444b7a00eed6e719d0622d07094d
|