Fast BPE tokenizer/trainer with a Rust core and Python bindings
Project description
unitoken
unitoken is a fast BPE tokenizer/trainer with a Rust core and optional Python bindings.
Install
Rust:
cargo add unitoken
Python (wheels via PyPI):
pip install uni-tokenizer
Quickstart (Python)
from unitoken import BpeTrainer, BpeEncoder
trainer = BpeTrainer(["<|endoftext|>"], None) # first token is treated as EOT
trainer.add_words({"hello": 10, "world": 7})
trainer.train(vocab_size=256)
trainer.save("demo")
enc = BpeEncoder.load("demo")
ids = enc.encode_word("hello")
Building from source
This project uses maturin for the Python extension module.
maturin develop
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
uni_tokenizer-0.1.0.tar.gz
(58.8 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file uni_tokenizer-0.1.0.tar.gz.
File metadata
- Download URL: uni_tokenizer-0.1.0.tar.gz
- Upload date:
- Size: 58.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.10.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d6dc611fb8c6fa22ba43b2ba06b9699757228388f8dcaa647ad538e03cae016d
|
|
| MD5 |
155a0d2a16099901f2624b5fe4bd0ba2
|
|
| BLAKE2b-256 |
05699f5fac391cb7075efc42a2d8aed4cfc3402bdcc6020063814f982bb504ba
|
File details
Details for the file uni_tokenizer-0.1.0-cp312-cp312-macosx_11_0_arm64.whl.
File metadata
- Download URL: uni_tokenizer-0.1.0-cp312-cp312-macosx_11_0_arm64.whl
- Upload date:
- Size: 1.3 MB
- Tags: CPython 3.12, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.10.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
94d5b2e2e94a1d67c93ab199484e109ffa988cb8e09702910597dfffbed2bad1
|
|
| MD5 |
93ef3c3364ac7b82d0201f4d2a8016dd
|
|
| BLAKE2b-256 |
831b6468dfee5d1bdc47cac6cb835e6f1aaf8142905e826088061a6357adaef0
|