ncRNA language model
Project description
ncRNABert: Deciphering the landscape of non-coding RNA using language model
Model details
| Model | # of parameters | # of hidden size | Pretraining dataset | # of ncRNAs | Model download |
|---|---|---|---|---|---|
| ncRNABert | 303M | 1024 | RNAcentral | 26M | Download |
Install
As a prerequisite, you must have PyTorch installed to use this repository.
You can use this one-liner for installation, using the latest release version
# latest version
pip install git+https://github.com/wangleiofficial/ncRNABert
# stable version
pip install ncRNABert
Usage
ncRNA sequence embedding
from ncRNABert.pretrain import load_ncRNABert
from ncRNABert.utils import BatchConverter
import torch
data = [
("ncRNA1", "ACGGAGGATGCGAGCGTTATCCGGATTTACTGGGCG"),
("ncRNA2", "AGGTTTTTAATCTAATTAAGATAGTTGA"),
]
ids, batch_token, lengths = BatchConverter(data)
model = load_ncRNABert()
with torch.no_grad():
results = model(batch_token, lengths, repr_layers=[24])
# Generate per-sequence representations via averaging
token_representations = results["representations"][24]
sequence_representations = []
sequence_representations_ex = []
batch_lens = [len(item[1]) for item in data]
for i, tokens_len in enumerate(batch_lens):
sequence_representations.append(token_representations[i].mean(0))
Comprehensive benchmarking of Large Language Models
When comparing the performance of different RNA language models, the ncRNABert model has demonstrated exceptional performance across multiple evaluation metrics. According to the tales, ncRNABert outperforms other models in terms of F1 score, achieving an average accuracy of 0.595, which is the highest among all the models.
| Methods | 16s | 23s | 5s | RNaseP | grp1 | srp | tRNA | telomerase | tmRNA | Average |
|---|---|---|---|---|---|---|---|---|---|---|
| ERNIE-RNA | 0.539 | 0.580 | 0.820 | 0.687 | 0.317 | 0.610 | 0.841 | 0.151 | 0.700 | 0.583 |
| RNA-FM | 0.152 | 0.193 | 0.555 | 0.324 | 0.136 | 0.277 | 0.763 | 0.121 | 0.293 | 0.313 |
| RNA-MSM | 0.133 | 0.223 | 0.264 | 0.207 | 0.189 | 0.151 | 0.338 | 0.072 | 0.240 | 0.202 |
| RNABERT | 0.144 | 0.167 | 0.211 | 0.171 | 0.144 | 0.152 | 0.458 | 0.101 | 0.152 | 0.189 |
| RNAErnie | 0.191 | 0.227 | 0.536 | 0.198 | 0.170 | 0.164 | 0.795 | 0.071 | 0.259 | 0.290 |
| RiNALMo | 0.473 | 0.596 | 0.796 | 0.667 | 0.566 | 0.548 | 0.845 | 0.093 | 0.669 | 0.584 |
| one-hot | 0.155 | 0.188 | 0.279 | 0.169 | 0.149 | 0.174 | 0.452 | 0.132 | 0.175 | 0.208 |
| ncRNABert | 0.573 | 0.733 | 0.773 | 0.629 | 0.423 | 0.589 | 0.789 | 0.161 | 0.688 | 0.595 |
| Methods | bpRNA | bpRNA-new |
|---|---|---|
| ERNIE-RNA | 0.628 | 0.601 |
| RNA-FM | 0.522 | 0.423 |
| RNA-MSM | 0.426 | 0.393 |
| RNABERT | 0.357 | 0.358 |
| RNAErnie | 0.442 | 0.387 |
| RiNALMo | 0.599 | 0.446 |
| one-hot | 0.351 | 0.383 |
| ncRNABert | 0.595 | 0.572 |
License
This source code is licensed under the Apache-2.0 license found in the LICENSE file in the root directory of this source tree.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ncrnabert-0.1.4.tar.gz.
File metadata
- Download URL: ncrnabert-0.1.4.tar.gz
- Upload date:
- Size: 11.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.9.21
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d998442dc72fa2826c98b172303de3353f7eb27ba1ade56e8296eceba8604cdc
|
|
| MD5 |
5098996b914ed3a154702da5c66adaaa
|
|
| BLAKE2b-256 |
a432bf5ef4a4234625c357b6b3041eadcba0371c9cb4dec28f6012b32cdeaea6
|
File details
Details for the file ncrnabert-0.1.4-py3-none-any.whl.
File metadata
- Download URL: ncrnabert-0.1.4-py3-none-any.whl
- Upload date:
- Size: 11.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.9.21
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d621fd9890b4e1352ee40fb2ae0c062a5348208b0f0dc57fb9b40b4037057dce
|
|
| MD5 |
e44442f3f9443c0f6720bb86df67c02b
|
|
| BLAKE2b-256 |
299022088aeef9dca7e0abbc358c8b2eb230e830213fd1e03222042997b5102a
|