No project description provided
Project description
RWKV Tokenizer (WIP)
A fast RWKV Tokenizer using Rust. This is my very first program using Rust, so there are still many things to improve and to fix :-)
Installation
You need first to install cargo (rust compiler), if you don't have it already. Cargo will be not needed when I later publish the whl package in pypi
$ sudo apt install cargo
Then install the rwkv-tokenizer python module:
$ pip install rwkv-tokenizer
Usage
>>> import rwkv_tokenizer
>>> tokenizer = rwkv_tokenizer.Tokenizer("./rwkv_vocab_v20230424.txt")
>>> tokenizer.encode("Today is a beautiful day. 今天是美好的一天。")
[33520, 4600, 332, 59219, 21509, 47, 33, 10381, 11639, 13091, 15597, 11685, 14734, 10250, 11639, 10080]
>>> tokenizer.decode([33520, 4600, 332, 59219, 21509, 47, 33, 10381, 11639, 13091, 15597, 11685, 14734, 10250, 11639, 10080])
'Today is a beautiful day. 今天是美好的一天。'
Bugs
There are still bugs where some characters are not encoded correctly. The bug have been fixed in the version 0.3.0.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
rwkv_tokenizer-0.3.2.tar.gz
(389.5 kB
view hashes)
Built Distribution
Close
Hashes for rwkv_tokenizer-0.3.2-cp310-cp310-macosx_10_12_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 365cc1facd0c1325e6b56d040f635f8f67dadadbd9a3a1e65f4f2b97731c179f |
|
MD5 | 8e4466a255befea3f62fb4b2d5611642 |
|
BLAKE2b-256 | 749a8bbbea40cde4263fe225b4f8642a4f194ffbe2764500bce61b68ec860d4b |