Skip to main content

A custom PLiCat model for lipid-binding Protein prediction

Project description


license: mit language:

  • en base_model:
  • EvolutionaryScale/esmc-300m-2024-12
  • google-bert/bert-base-uncased new_version: Noora68/PLiCat-0.4B tags:
  • biology
  • protein
  • protein classification
  • lipid binding
  • lipid binding site
  • recognition


PLiCat (Protein–Lipid interaction Categorization tool)

we present a robust prediction tool termed PLiCat (Protein–Lipid interaction Categorization tool) for predicting the lipid categories that interact with proteins, utilizing protein sequences as the only input. Using a combined model architecture by the fusion of ESM C and BERT models, our method enables accurate and interpretable prediction to distinguish lipid-binding signature among the 8 major lipid categories defined by LIPID MAPS. PLiCat will serve as a powerful tool to facilitate the exploration of lipid-binding specificity and rational protein design.



Model Details

  • Architecture: ESM Cambrian + BERT + classification head
  • Task: Multi-label protein-lipid binding prediction
  • Fine-tuned from: ESMC_300m + bert-base-uncased
  • Developed by: Noora68
  • Framework: PyTorch + HuggingFace Transformers

Model usage workflow:

  1. Load the model and tokenizer
  2. Process the input sequence (tokenize → batch → pad → mask)
  3. Run inference to obtain logits → probabilities
  4. Output the results and mark high-confidence categories

Usage

from plicat_model import PLiCat
import torch
from torch.nn.utils.rnn import pad_sequence
from esm.tokenization import EsmSequenceTokenizer

# Set device (GPU if available, otherwise CPU)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = EsmSequenceTokenizer()

# Default lipid type dictionary
default_dict = {
    "0": "NotLipidType",
    "1": "Fatty Acyl (FA)",
    "2": "Prenol Lipid (PR)",
    "3": "Glycerophospholipid (GP)",
    "4": "Sterol Lipid (ST)",
    "5": "Polyketide (PK)",
    "6": "Glycerolipid (GL)",
    "7": "Sphingolipid (SP)",
    "8": "Saccharolipid (SL)"
}

# Load pretrained PLiCat model
model = PLiCat.from_pretrained("Noora68/PLiCat-0.4B").to(device)

# Example protein sequence
sequence = "MDSNFLKYLSTAPVLFTVWLSFTASFIIEANRFFPDMLYFPM"

# Tokenize the sequence -> input_ids
input_ids = torch.tensor(tokenizer.encode(sequence))

# Add batch dimension: (batch_size=1, length)
input_ids = input_ids.unsqueeze(0)

# Pad to the longest sequence in the batch
input_ids_padded = pad_sequence(input_ids, batch_first=True, padding_value=tokenizer.pad_token_id)

# Build attention mask: 1 for real tokens, 0 for padding
attention_mask = (input_ids_padded != tokenizer.pad_token_id).long()

# Move tensors to the same device as model
input_ids_padded = input_ids_padded.to(device)
attention_mask = attention_mask.to(device)

# Forward pass (no gradient needed during inference)
with torch.no_grad():
    outputs = model(input_ids_padded, attention_mask)

# Convert logits to probabilities using sigmoid
probs = torch.sigmoid(outputs['logits'])

# Convert to CPU and numpy array
probs = probs.squeeze().detach().cpu().numpy()

# Print results: add a check mark if probability > 0.6
for i, p in enumerate(probs):
    mark = " √" if p > 0.6 else ""
    print(f"{default_dict[str(i)]:<25}: {p:.4f}{mark}")

output of the above example is:

NotLipidType             : 0.0007
Fatty Acyl (FA)          : 0.1092
Prenol Lipid (PR)        : 0.9178 √
Glycerophospholipid (GP) : 0.6059 √
Sterol Lipid (ST)        : 0.0083
Polyketide (PK)          : 0.0026
Glycerolipid (GL)        : 0.0771
Sphingolipid (SP)        : 0.0002
Saccharolipid (SL)       : 0.0000

Limitations

  • Trained only on lipid-binding protein data and may not generalize to other functions.
  • Model performance is best with sequence lengths under 500.
  • Dataset size is limited compared to large-scale protein corpora.
  • Model may reflect biases present in training data (e.g., under-representation of certain lipid types).

Citation

If you use this model, please cite:

@article{your2025paper,
  title={Deciphering the code of lipid binding by large language model},
  author={Feitong Dong,},
  journal={Bioinformatics},
  year={2025}
}

License

MIT License


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

plicat_model-0.1.0.tar.gz (5.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

plicat_model-0.1.0-py3-none-any.whl (5.1 kB view details)

Uploaded Python 3

File details

Details for the file plicat_model-0.1.0.tar.gz.

File metadata

  • Download URL: plicat_model-0.1.0.tar.gz
  • Upload date:
  • Size: 5.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.10

File hashes

Hashes for plicat_model-0.1.0.tar.gz
Algorithm Hash digest
SHA256 6907234e600e0bb00333a4cbff6e8b007f3e4939bbfb9ca5c97308b93e391fb1
MD5 b94f3df9a9d91cab2c687f84ce33d951
BLAKE2b-256 200b2a6429e4618da81f89d67d3baa73c621ece7d0179fd7fdbafd76e3fedac9

See more details on using hashes here.

File details

Details for the file plicat_model-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: plicat_model-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 5.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.10

File hashes

Hashes for plicat_model-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 69feb6b8995204aeb7db68123a8100e755cc946a178d7aec0249f68d525a497b
MD5 3e02b7c514b368b22d1f1900b9b1b8af
BLAKE2b-256 d7c3f3116a4aa09596b3eac9c72f7dc183de026ad147355204d0e66773934e69

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page