Skip to main content

Minimal BPE tokenizer in Zig

Project description

Tokenizer

Alt text

BPE tokenizer implemented entirely in Zig.

Example integration with LLMs at nnx-lm.

Requirement

zig v0.13.0

Install

git clone https://github.com/jaco-bro/tokenizer
cd tokenizer
zig build exe --release=fast

Usage

  • zig-out/bin/tokenizer_exe [--model MODEL_NAME] COMMAND INPUT
  • zig build run -- [--model MODEL_NAME] COMMAND INPUT
zig build run -- --encode "hello world"
zig build run -- --decode "{14990, 1879}"
zig build run -- --model "phi-4-4bit" --encode "hello world"
zig build run -- --model "phi-4-4bit" --decode "15339 1917"
zig build run -- --repo "Qwen" --model "Qwen3-0.6B" --encode "안녕"
zig build run -- --repo "Qwen" --model "Qwen3-0.6B" --decode "126246 144370"

Python (optional)

Tokenizer is also pip-installable for use from Python:

pip install tokenizerz
python

Usage:

>>> import tokenizerz
>>> tokenizer = tokenizerz.Tokenizer()
Directory 'Qwen2.5-Coder-0.5B' created successfully.
DL% UL%  Dled  Uled  Xfers  Live Total     Current  Left    Speed
100 --  6866k     0     1     0   0:00:01  0:00:01 --:--:-- 4904k
Download successful.
>>> tokens = tokenizer.encode("Hello, world!")
>>> print(tokens)
[9707, 11, 1879, 0]
>>> tokenizer.decode(tokens)
'Hello, world!'

Shell:

bpe --encode "hello world"

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokenizerz-0.0.3.tar.gz (16.7 kB view details)

Uploaded Source

File details

Details for the file tokenizerz-0.0.3.tar.gz.

File metadata

  • Download URL: tokenizerz-0.0.3.tar.gz
  • Upload date:
  • Size: 16.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.8

File hashes

Hashes for tokenizerz-0.0.3.tar.gz
Algorithm Hash digest
SHA256 be3709ee9d8db6309aea6b70da140c4b60fe36615c9de412c3300a39f5a494cf
MD5 6e8e6ff3a897f0b02998040fe084830b
BLAKE2b-256 394f59f4634e191825301e000e639bb3de20c0658757ffbfa7169eec4a9b09ed

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page