No project description provided
Project description
RWKV Tokenizer (WIP)
A fast RWKV Tokenizer using Rust. This is my very first program using Rust, so there are still many things to improve and to fix :-)
Installation
You need first to install cargo (rust compiler), if you don't have it already. Cargo will be not needed when I later publish the whl package in pypi
$ sudo apt install cargo
Then install the rwkv-tokenizer python module:
$ pip install rwkv-tokenizer
Usage
>>> import rwkv_tokenizer
>>> tokenizer = rwkv_tokenizer.Tokenizer("./rwkv_vocab_v20230424.txt")
>>> tokenizer.encode("Today is a beautiful day. 今天是美好的一天。")
[33520, 4600, 332, 59219, 21509, 47, 33, 10381, 11639, 13091, 15597, 11685, 14734, 10250, 11639, 10080]
>>> tokenizer.decode([33520, 4600, 332, 59219, 21509, 47, 33, 10381, 11639, 13091, 15597, 11685, 14734, 10250, 11639, 10080])
'Today is a beautiful day. 今天是美好的一天。'
Bugs
There are still bugs where some characters are not encoded correctly. The bug have been fixed in the version 0.3.0.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
rwkv_tokenizer-0.3.1.tar.gz
(389.4 kB
view hashes)
Built Distribution
Close
Hashes for rwkv_tokenizer-0.3.1-cp310-cp310-macosx_10_12_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 46d6ae7293f7a8efbae6484204743837fbcf2a1cfc640d5a7dcbf1766717146f |
|
MD5 | 3020e8fd06b22149e09f5db80c028766 |
|
BLAKE2b-256 | aff2f836fa54b4427295983c6797776ce0bc75af3c8fbd54f10eacbb957634d8 |