bytepiece-rs Python binding
Project description
rs-bytepiece
Install
pip install rs_bytepiece
Usage
from rs_bytepiece import Tokenizer
tokenizer = Tokenizer()
# a custom model
tokenizer = Tokenizer("/path/to/model")
ids = tokenizer.encode("今天天气不错")
text = tokenizer.decode(ids)
Performance
The performance is a bit faster than the original implementation. I've tested (on my M2 16G) the《鲁迅全集》which has 625890 chars. The time unit is millisecond.
length | jieba | aho_py | aho_cy | aho_rs |
---|---|---|---|---|
100 | 17062.12 | 1404.37 | 564.31 | 112.94 |
1000 | 17104.38 | 1424.6 | 573.32 | 113.18 |
10000 | 17432.58 | 1429.0 | 574.93 | 110.03 |
100000 | 17228.17 | 1401.01 | 574.5 | 110.44 |
625890 | 17305.95 | 1419.79 | 567.78 | 108.54 |
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
rs_bytepiece-0.2.2.tar.gz
(1.2 MB
view hashes)
Built Distributions
Close
Hashes for rs_bytepiece-0.2.2-cp37-abi3-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2e6a9f8b78bc24d4856b00240c82392ebe6008b110ba482e22c00f48f81b4e6c |
|
MD5 | a20df1d801d72386db90a6e23dfc9efe |
|
BLAKE2b-256 | 6230dea969abe55a7cf936c69b32b5ddda7eb09687300be9fa8923535da0d9f6 |
Close
Hashes for rs_bytepiece-0.2.2-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2405a38e0a03985fabeb025eae65fcafcf5aae7c5f9b065c10cc17b72ae5e723 |
|
MD5 | 52a51aebe3216eadb80bafb8150db11a |
|
BLAKE2b-256 | 14610a22cb90c845829383640d2a9ef17ab0286e9cef9189b7346194b5df2b72 |
Close
Hashes for rs_bytepiece-0.2.2-cp37-abi3-macosx_10_7_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 14338a8c8573df2ac4dc83567a58722d18984f678ce3426237d58e689288df1d |
|
MD5 | 485620da2afd64f4758720e3ab7704aa |
|
BLAKE2b-256 | 4863809ac38242cf82098144777c8f82a94290d15aa04f9c45dbb57da7a78be8 |