protein language model
Project description
ProtFlash: A lightweight protein language model
Install
As a prerequisite, you must have PyTorch installed to use this repository.
You can use this one-liner for installation, using the latest release version
# latest version
pip install git+https://github.com/isyslab-hust/ProtFlash
# stable version
pip install ProtFlash
Model details
Model | # of parameters | # of hidden size | Pretraining dataset | # of proteins | Model download |
---|---|---|---|---|---|
ProtFlash-base | 174M | 768 | UniRef100 | 51M | ProtFlash-base |
ProtFlash-small | 79M | 512 | UniRef50 | 51M | ProtFlash-small |
Usage
protein sequence embedding
from ProtFlash.pretrain import load_prot_flash_base
from ProtFlash.utils import batchConverter
data = [
("protein1", "MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG"),
("protein2", "KALTARQQEVFDLIRDHISQTGMPPTRAEIAQRLGFRSPNAAEEHLKALARKGVIEIVSGASRGIRLLQEE"),
]
ids, batch_token, lengths = batchConverter(data)
model = load_prot_flash_base()
with torch.no_grad():
token_embedding = model(batch_token, lengths)
# Generate per-sequence representations via averaging
sequence_representations = []
for i, (_, seq) in enumerate(data):
sequence_representations.append(token_embedding[i, 0: len(seq) + 1].mean(0))
loading weight files
import torch
from ProtFlash.model import FLASHTransformer
model_data = torch.load(your_parameters_file)
hyper_parameter = model_data["hyper_parameters"]
model = FLASHTransformer(hyper_parameter['dim'], hyper_parameter['num_tokens'], hyper_parameter ['num_layers'], group_size=hyper_parameter['num_tokens'],
query_key_dim=hyper_parameter['qk_dim'], max_rel_dist=hyper_parameter['max_rel_dist'], expansion_factor=hyper_parameter['expansion_factor'])
model.load_state_dict(model_data['state_dict'])
License
This source code is licensed under the MIT license found in the LICENSE file in the root directory of this source tree.
Citation
If you use this code or one of our pretrained models for your publication, please cite our paper:
Lei Wang, Hui Zhang, Wei Xu, Zhidong Xue, and Yan Wang. ProtFlash: Deciphering the protein landscape with a novel and lightweight language model, Under revision (2023)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
ProtFlash-0.1.1.tar.gz
(6.5 kB
view details)
Built Distribution
File details
Details for the file ProtFlash-0.1.1.tar.gz
.
File metadata
- Download URL: ProtFlash-0.1.1.tar.gz
- Upload date:
- Size: 6.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.17
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | de8eb7ab3ceae2858ad6e6ee6746b8be94b235c0f59fa81674e3fd1d68317d22 |
|
MD5 | 2cb9eed190b9a3dfaca9cea9d2fd7356 |
|
BLAKE2b-256 | 8ef1d7cbe20c911dea4e93b6711af96775458e1b18fe94fbf5f7632392927c5d |
File details
Details for the file ProtFlash-0.1.1-py3-none-any.whl
.
File metadata
- Download URL: ProtFlash-0.1.1-py3-none-any.whl
- Upload date:
- Size: 7.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.17
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6dbcf564e0962c6310ef572f2b5902698e570a73c12d457491fb73d95b6bfe49 |
|
MD5 | 2d43095e9233fbe0111b4c25208bea41 |
|
BLAKE2b-256 | c5aad637a378b05fcfec846fb265fa87ceb9aa4edfed7f62cbe8261fe7d1e71e |