Skip to main content

A CLI tool for predicting lncRNA–Protein interactions using transformer embeddings and CatBoost

Project description

🧬 lncrna-PI - LncRNA–Protein Interaction Prediction

lncrnaPI is a command-line tool for predicting lncRNA–Protein interactions using pre-trained language models (DNABERT-2 and ESM-2) for sequence embedding and a CatBoost classifier for interaction probability estimation.

It supports two modes:

  • Rapid — based on sequence composition features (fast, lightweight)
  • LLM — based on transformer embeddings (DNABERT2 + ESM2)

The script performs all-by-all predictions between every lncRNA and every protein sequence in the provided FASTA files.


📦 Features

  • Vectorized and efficient FASTA parsing
  • All-by-all pairing of lncRNA and protein sequences
  • Automatic feature extraction:
    • Rapid mode: nucleotide and amino acid composition (%)
    • LLM mode: transformer-based embeddings (DNABERT2 + ESM2)
  • Automatic model selection:
    • catboost_model_rapid.joblib → Composition model
    • catboost_dnabert2_esm-t30.joblib → Embedding model
  • GPU-aware embedding generation (with safe fallback to CPU)
  • Generates probability and binary interaction predictions

🧰 Dependencies

The tool was developed using Python 3.10. Install the following dependencies before running the script:

pip install torch==2.6.0 transformers==4.57.0 catboost==1.2.8 joblib tqdm numpy pandas

⚙️ Usage

1️⃣ Rapid (Composition-Based) Prediction

This mode uses simple % composition features (very fast).

python lncrnapi_cli.py   -lf ./data/example_lncRNA.fasta   -pf ./data/example_protein.fasta   -wd ./output   -model rapid

Model used:
./model/catboost_model_rapid.joblib


2️⃣ LLM (Embedding-Based) Prediction

This mode uses transformer embeddings from DNABERT2 (for lncRNA) and ESM2-T30 (for protein).

python lncrnapi_cli.py  -lf ./data/example_lncRNA.fasta   -pf ./data/example_protein.fasta   -wd ./output   -model llm

Model used:
./model/catboost_dnabert2_esm-t30.joblib


Arguments

Argument Description Required
-lf Path to the FASTA file containing lncRNA sequences.
-pf Path to the FASTA file containing protein sequences.
-wd Path to the working directory.
-model Choice of model to be used.
-t Threshold

💾 Output

A CSV file named output.csv is generated in the output directory:

lncRNA_ID Protein_ID Interaction_Probability Predicted_Label
lnc1 P12345 0.87 1
lnc1 P67890 0.34 0
... ... ... ...
  • Interaction_Probability: Probability predicted by CatBoost
  • Predicted_Label: 1 → interaction, 0 → non-interaction

⚡ Hardware Acceleration

The script automatically detects and uses available hardware:

  • CUDA GPU (NVIDIA)
  • MPS (Apple Silicon)
  • ⚠️ CPU (fallback)

🛠 Troubleshooting

Issue Possible Cause Solution
Model file not found Wrong --model_path Check the file path
No sequences found in FASTA Invalid FASTA format Ensure > headers are present
safetensors error Missing library Install with pip install safetensors
Slow performance CPU usage Use GPU-enabled environment

📜 Citation

If you use this tool in your research, please cite:

** Choudhury et al.**
Prediction of lncRNA-protein interacting pairs using LLM embeddings based on evolutionary information (2025)


🧩 Repository Structure

├── lncrnapi_cli.py       # Main CLI script
├── README.md
├── LICENSE
├── test_lncrna.fa
├── test_protein.fa
├── output.csv 
└── models/
    ├── catboost_model_rapid.joblib
    └── catboost_dnabert2_esm-t30.joblib                

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lncrnapi-1.2.tar.gz (1.9 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lncrnapi-1.2-py3-none-any.whl (1.9 MB view details)

Uploaded Python 3

File details

Details for the file lncrnapi-1.2.tar.gz.

File metadata

  • Download URL: lncrnapi-1.2.tar.gz
  • Upload date:
  • Size: 1.9 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.18

File hashes

Hashes for lncrnapi-1.2.tar.gz
Algorithm Hash digest
SHA256 413669292fb0d092040ead85aa03f02da91e21510ca3b2a3420fec7c814a1953
MD5 8f86a39a4ca23436c0ce48f57bcf9fbd
BLAKE2b-256 1e52f3140613c6d265404aac937902445ab54443aab51e9cfa4e2a439ddae103

See more details on using hashes here.

File details

Details for the file lncrnapi-1.2-py3-none-any.whl.

File metadata

  • Download URL: lncrnapi-1.2-py3-none-any.whl
  • Upload date:
  • Size: 1.9 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.18

File hashes

Hashes for lncrnapi-1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 de50404660bb4439f91ae8c50dce6cbf5b812774a7d52ccd4dd1d32876f299f4
MD5 655ca15fdba1a8c622e3d62cc2bb2a88
BLAKE2b-256 1245f45a6b087aca6cac649095afafc266f99a755cb825866d2a27efd42c9fa9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page