Minimal BPE tokenizer in Zig
Project description
Tokenizer
BPE tokenizer implemented entirely in Zig.
Example integration with LLMs at nnx-lm.
Requirement
zig v0.13.0
Install
git clone https://github.com/jaco-bro/tokenizer
cd tokenizer
zig build exe --release=fast
Usage
zig-out/bin/tokenizer_exe [--model MODEL_NAME] COMMAND INPUTzig build run -- [--model MODEL_NAME] COMMAND INPUT
zig build run -- --encode "hello world"
zig build run -- --decode "{14990, 1879}"
zig build run -- --model "phi-4-4bit" --encode "hello world"
zig build run -- --model "phi-4-4bit" --decode "15339 1917"
zig build run -- --repo "Qwen" --model "Qwen3-0.6B" --encode "안녕"
zig build run -- --repo "Qwen" --model "Qwen3-0.6B" --decode "126246 144370"
Python (optional)
Tokenizer is also pip-installable for use from Python:
pip install tokenizerz
python
Usage:
>>> import tokenizerz
>>> tokenizer = tokenizerz.Tokenizer()
Directory 'Qwen2.5-Coder-0.5B' created successfully.
DL% UL% Dled Uled Xfers Live Total Current Left Speed
100 -- 6866k 0 1 0 0:00:01 0:00:01 --:--:-- 4904k
Download successful.
>>> tokens = tokenizer.encode("Hello, world!")
>>> print(tokens)
[9707, 11, 1879, 0]
>>> tokenizer.decode(tokens)
'Hello, world!'
Shell:
bpe --encode "hello world"
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
tokenizerz-0.0.3.tar.gz
(16.7 kB
view details)
File details
Details for the file tokenizerz-0.0.3.tar.gz.
File metadata
- Download URL: tokenizerz-0.0.3.tar.gz
- Upload date:
- Size: 16.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
be3709ee9d8db6309aea6b70da140c4b60fe36615c9de412c3300a39f5a494cf
|
|
| MD5 |
6e8e6ff3a897f0b02998040fe084830b
|
|
| BLAKE2b-256 |
394f59f4634e191825301e000e639bb3de20c0658757ffbfa7169eec4a9b09ed
|