Skip to main content

Fast BPE tokenizer/trainer with a Rust core and Python bindings

Project description

unitoken

unitoken is a fast BPE tokenizer/trainer with a Rust core and optional Python bindings.

Install

Rust:

cargo add unitoken

Python (wheels via PyPI):

pip install uni-tokenizer

Quickstart (Python)

from unitoken import BpeTrainer, BpeEncoder

trainer = BpeTrainer(["<|endoftext|>"], None)  # first token is treated as EOT
trainer.add_words({"hello": 10, "world": 7})
trainer.train(vocab_size=256)
trainer.save("demo")

enc = BpeEncoder.load("demo")
ids = enc.encode_word("hello")

Building from source

This project uses maturin for the Python extension module.

maturin develop

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

uni_tokenizer-0.1.0.tar.gz (58.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

uni_tokenizer-0.1.0-cp312-cp312-macosx_11_0_arm64.whl (1.3 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

File details

Details for the file uni_tokenizer-0.1.0.tar.gz.

File metadata

  • Download URL: uni_tokenizer-0.1.0.tar.gz
  • Upload date:
  • Size: 58.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.10.2

File hashes

Hashes for uni_tokenizer-0.1.0.tar.gz
Algorithm Hash digest
SHA256 d6dc611fb8c6fa22ba43b2ba06b9699757228388f8dcaa647ad538e03cae016d
MD5 155a0d2a16099901f2624b5fe4bd0ba2
BLAKE2b-256 05699f5fac391cb7075efc42a2d8aed4cfc3402bdcc6020063814f982bb504ba

See more details on using hashes here.

File details

Details for the file uni_tokenizer-0.1.0-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for uni_tokenizer-0.1.0-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 94d5b2e2e94a1d67c93ab199484e109ffa988cb8e09702910597dfffbed2bad1
MD5 93ef3c3364ac7b82d0201f4d2a8016dd
BLAKE2b-256 831b6468dfee5d1bdc47cac6cb835e6f1aaf8142905e826088061a6357adaef0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page