No project description provided
Project description
RWKV Tokenizer (WIP)
A fast RWKV Tokenizer using Rust. This is my very first program using Rust, so there are still many things to improve and to fix :-)
Installation
You need first to install cargo (rust compiler), if you don't have it already. Cargo will be not needed when I later publish the whl package in pypi
$ sudo apt install cargo
Then install the rwkv-tokenizer python module:
$ pip install rwkv-tokenizer
Usage
>>> import rwkv_tokenizer
>>> tokenizer = rwkv_tokenizer.Tokenizer("./rwkv_vocab_v20230424.txt")
>>> tokenizer.encode("Today is a beautiful day. 今天是美好的一天。")
[33520, 4600, 332, 59219, 21509, 47, 33, 10381, 11639, 13091, 15597, 11685, 14734, 10250, 11639, 10080]
>>> tokenizer.decode([33520, 4600, 332, 59219, 21509, 47, 33, 10381, 11639, 13091, 15597, 11685, 14734, 10250, 11639, 10080])
'Today is a beautiful day. 今天是美好的一天。'
Bugs
There are still bugs where some characters are not encoded correctly.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
rwkv_tokenizer-0.3.0.tar.gz
(389.3 kB
view hashes)
Built Distribution
Close
Hashes for rwkv_tokenizer-0.3.0-cp310-cp310-macosx_10_12_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3a30520a5d733b20aebc76b39bdfa91fbb749dd59ebc05673fb5bbd21b9f2bad |
|
MD5 | 6490f01932c7bacd02252d44baf93b29 |
|
BLAKE2b-256 | 20d1b3a32440016975334de157e4673fd6eb70ac75b64697a6e2826e0bfa39e5 |