bytepiece-rs Python binding
Project description
rs-bytepiece
Install
pip install rs_bytepiece
Usage
from rs_bytepiece import Tokenizer
tokenizer = Tokenizer()
# a custom model
tokenizer = Tokenizer("/path/to/model")
ids = tokenizer.encode("今天天气不错")
text = tokenizer.decode(ids)
Performance
The performance is a bit faster than the original implementation. I've tested (on my M2 16G) the《鲁迅全集》which has 625890 chars. The time unit is millisecond.
length | jieba | aho_py | aho_cy | aho_rs |
---|---|---|---|---|
100 | 17062.12 | 1404.37 | 564.31 | 112.94 |
1000 | 17104.38 | 1424.6 | 573.32 | 113.18 |
10000 | 17432.58 | 1429.0 | 574.93 | 110.03 |
100000 | 17228.17 | 1401.01 | 574.5 | 110.44 |
625890 | 17305.95 | 1419.79 | 567.78 | 108.54 |
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
rs_bytepiece-0.2.1.tar.gz
(1.2 MB
view hashes)
Built Distributions
Close
Hashes for rs_bytepiece-0.2.1-cp37-abi3-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a1cdb9cf73096417dbd705d31223c8a0276f3cd06b78893f6a4ff3814480aa00 |
|
MD5 | 7f15057c87c6bd840a580785913a83cf |
|
BLAKE2b-256 | 2c26a93bae87c966342d873a2f1f5baf06bb40819723190615380c9684fee267 |
Close
Hashes for rs_bytepiece-0.2.1-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 662f017b41eabe12d3763458bac671f0fd3e3e6ac17bba95faa7aea527ba7cb0 |
|
MD5 | 55e12e4c5e6bf5e9f7b5f7b7edc24853 |
|
BLAKE2b-256 | 1733a690074407673267e7b30104afbcd906cbf359d2c9401cf7fb969caaad70 |
Close
Hashes for rs_bytepiece-0.2.1-cp37-abi3-macosx_10_7_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | aecd2f84552062d665c6c4b26aada2b617c5fe7a46fbed3177c54c4a15d08b91 |
|
MD5 | 6652d01087c01ef3964e23e2321ad629 |
|
BLAKE2b-256 | a1ec6442e93945f67d7bee7a61ee55cd9884f48d7ca50accb59f43e944d6f100 |