ProLM model utilities
Project description
ProLM
安装说明
pip install prolm
分词器使用说明
1. Tokenize
输入格式为 content...content这种格式的字符串,其中tag可以为aas(amino acid sequence), cds(密码子), ncds(非编码核酸)
from prolm.prolm_tokenizer import ProLMTokenizer
tokenizer = ProLMTokenizer()
# 蛋白质tokenize
protein_sequence = "<aas>MAVFGHVLNM</aas>"
print(tokenizer.tokenize(protein_sequence))
# ['<cls>', 'M', 'A', 'V', 'F', 'G', 'H', 'V', 'L', 'N', 'M', '<sep>']
# 密码子tokenize
cds_sequence = "<cds>ACGCGTACG</cds>"
print(tokenizer.tokenize(cds_sequence))
# ['<cls>', 'acg', 'cgt', 'acg', '<sep>']
# 非编码核酸tokenize
ncds_sequence = "<ncds>ACGCGTACG</ncds>"
print(tokenizer.tokenize(ncds_sequence))
# ['<cls>', 'a', 'c', 'g', 'c', 'g', 't', 'a', 'c', 'g', '<sep>']
# 复合体tokenize
complex_sequence = "<cds>ATCGCT</cds><ncds>atcg</ncds><aas>MAV</aas>"
print(tokenizer.tokenize(complex_sequence))
# ['<cls>', 'atc', 'gct', '<sep>', 'a', 't', 'c', 'g', '<sep>', 'M', 'A', 'V', '<sep>']
# 不添加<cls>和<sep>
ncds_sequence = "<ncds>ACGCGTACG</ncds>"
print(tokenizer.tokenize(ncds_sequence, special_add=None))
# ['a', 'c', 'g', 'c', 'g', 't', 'a', 'c', 'g']
# 添加<bos>和<eos>(decoder专用)
ncds_sequence = "<ncds>ACGCGTACG</ncds>"
print(tokenizer.tokenize(ncds_sequence, special_add="decoder"))
2. Tokenize and to tensor (输入是列表)
ncds_sequence = ["<ncds>ACGCGTACG</ncds>", ]
print(tokenizer(ncds_sequence, special_add="decoder"))
# {'input_ids': tensor([[ 4, 93, 94, 95, 94, 95, 96, 93, 94, 95, 5]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]), 'length': tensor([11]), 'special_tokens_mask': tensor([[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]])}
3. 模型更新(0.0.4)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
No source distribution files available for this release.See tutorial on generating distribution archives.
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
prolm-0.0.8-py3-none-any.whl
(27.0 kB
view details)
File details
Details for the file prolm-0.0.8-py3-none-any.whl.
File metadata
- Download URL: prolm-0.0.8-py3-none-any.whl
- Upload date:
- Size: 27.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.10.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9ec7ddc60783a6583b34c1158f09c6ec05203d22c39dd0e1998b7ff19f5ad34c
|
|
| MD5 |
cbc03b53a79ab9b6a5ffdbd8d7e3f065
|
|
| BLAKE2b-256 |
322585dafa04fbc2de7a7fc9c8658b3ddda484528797dc2eeea3bd291b4222ee
|