A CLI tool for predicting lncRNA–Protein interactions using transformer embeddings and CatBoost
Project description
🧬 lncrna-PI - LncRNA–Protein Interaction Prediction
lncrnaPI is a command-line tool for predicting lncRNA–Protein interactions using pre-trained language models (DNABERT-2 and ESM-2) for sequence embedding and a CatBoost classifier for interaction probability estimation.
It supports two modes:
- Rapid — based on sequence composition features (fast, lightweight)
- LLM — based on transformer embeddings (DNABERT2 + ESM2)
The script performs all-by-all predictions between every lncRNA and every protein sequence in the provided FASTA files.
📦 Features
- Vectorized and efficient FASTA parsing
- All-by-all pairing of lncRNA and protein sequences
- Automatic feature extraction:
- Rapid mode: nucleotide and amino acid composition (%)
- LLM mode: transformer-based embeddings (DNABERT2 + ESM2)
- Automatic model selection:
catboost_model_rapid.joblib→ Composition modelcatboost_dnabert2_esm-t30.joblib→ Embedding model
- GPU-aware embedding generation (with safe fallback to CPU)
- Generates probability and binary interaction predictions
🧰 Dependencies
The tool was developed using Python 3.10. Install the following dependencies before running the script:
pip install torch==2.6.0 transformers==4.57.0 catboost==1.2.8 joblib tqdm numpy pandas
⚙️ Usage
1️⃣ Rapid (Composition-Based) Prediction
This mode uses simple % composition features (very fast).
python lncrnapi_cli.py -lf ./data/example_lncRNA.fasta -pf ./data/example_protein.fasta -wd ./output -model rapid
Model used:
./model/catboost_model_rapid.joblib
2️⃣ LLM (Embedding-Based) Prediction
This mode uses transformer embeddings from DNABERT2 (for lncRNA) and ESM2-T30 (for protein).
python lncrnapi_cli.py -lf ./data/example_lncRNA.fasta -pf ./data/example_protein.fasta -wd ./output -model llm
Model used:
./model/catboost_dnabert2_esm-t30.joblib
Arguments
| Argument | Description | Required |
|---|---|---|
-lf |
Path to the FASTA file containing lncRNA sequences. | ✅ |
-pf |
Path to the FASTA file containing protein sequences. | ✅ |
-wd |
Path to the working directory. | ✅ |
-model |
Choice of model to be used. | ✅ |
-t |
Threshold | ❌ |
💾 Output
A CSV file named output.csv is generated in the output directory:
| lncRNA_ID | Protein_ID | Interaction_Probability | Predicted_Label |
|---|---|---|---|
| lnc1 | P12345 | 0.87 | 1 |
| lnc1 | P67890 | 0.34 | 0 |
| ... | ... | ... | ... |
- Interaction_Probability: Probability predicted by CatBoost
- Predicted_Label: 1 → interaction, 0 → non-interaction
⚡ Hardware Acceleration
The script automatically detects and uses available hardware:
- ✅ CUDA GPU (NVIDIA)
- ✅ MPS (Apple Silicon)
- ⚠️ CPU (fallback)
🛠 Troubleshooting
| Issue | Possible Cause | Solution |
|---|---|---|
Model file not found |
Wrong --model_path |
Check the file path |
No sequences found in FASTA |
Invalid FASTA format | Ensure > headers are present |
safetensors error |
Missing library | Install with pip install safetensors |
| Slow performance | CPU usage | Use GPU-enabled environment |
📜 Citation
If you use this tool in your research, please cite:
** Choudhury et al.**
Prediction of lncRNA-protein interacting pairs using LLM embeddings based on evolutionary information (2025)
🧩 Repository Structure
├── lncrnapi_cli.py # Main CLI script
├── README.md
├── LICENSE
├── test_lncrna.fa
├── test_protein.fa
├── output.csv
└── models/
├── catboost_model_rapid.joblib
└── catboost_dnabert2_esm-t30.joblib
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file lncrnapi-1.2.tar.gz.
File metadata
- Download URL: lncrnapi-1.2.tar.gz
- Upload date:
- Size: 1.9 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.18
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
413669292fb0d092040ead85aa03f02da91e21510ca3b2a3420fec7c814a1953
|
|
| MD5 |
8f86a39a4ca23436c0ce48f57bcf9fbd
|
|
| BLAKE2b-256 |
1e52f3140613c6d265404aac937902445ab54443aab51e9cfa4e2a439ddae103
|
File details
Details for the file lncrnapi-1.2-py3-none-any.whl.
File metadata
- Download URL: lncrnapi-1.2-py3-none-any.whl
- Upload date:
- Size: 1.9 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.18
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
de50404660bb4439f91ae8c50dce6cbf5b812774a7d52ccd4dd1d32876f299f4
|
|
| MD5 |
655ca15fdba1a8c622e3d62cc2bb2a88
|
|
| BLAKE2b-256 |
1245f45a6b087aca6cac649095afafc266f99a755cb825866d2a27efd42c9fa9
|