Skip to main content

Fast BPE tokenizer/trainer with a Rust core and Python bindings

Project description

unitoken

unitoken is a fast BPE tokenizer/trainer with a Rust core and optional Python bindings.

Install

Rust:

cargo add unitoken

Python (wheels via PyPI):

pip install uni-tokenizer

Quickstart (Python)

from uni_tokenizer import BpeTrainer, BpeEncoder

trainer = BpeTrainer(["<|endoftext|>"])  # first token is treated as EOT
trainer.add_words({"hello": 10, "world": 7})
trainer.train(vocab_size=256)
trainer.save("demo")

enc = BpeEncoder.load("demo")
ids = enc.encode_word("hello")

Building from source

This project uses maturin for the Python extension module.

maturin develop

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

uni_tokenizer-0.1.1.tar.gz (60.1 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

uni_tokenizer-0.1.1-cp38-abi3-win_amd64.whl (1.4 MB view details)

Uploaded CPython 3.8+Windows x86-64

uni_tokenizer-0.1.1-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.17+ x86-64

uni_tokenizer-0.1.1-cp38-abi3-macosx_11_0_arm64.whl (1.3 MB view details)

Uploaded CPython 3.8+macOS 11.0+ ARM64

File details

Details for the file uni_tokenizer-0.1.1.tar.gz.

File metadata

  • Download URL: uni_tokenizer-0.1.1.tar.gz
  • Upload date:
  • Size: 60.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for uni_tokenizer-0.1.1.tar.gz
Algorithm Hash digest
SHA256 5a715c364d0837f66e048650b6679b07dc3393347965d6228758eef9c8a6b8fe
MD5 d3d699a299fea048d7622e8ea4c54f43
BLAKE2b-256 cd5b0068208358c9254ba158506fd60470e229941331be1f226ad63a2137dc58

See more details on using hashes here.

File details

Details for the file uni_tokenizer-0.1.1-cp38-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for uni_tokenizer-0.1.1-cp38-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 a7ba73cda2bc8b395725ed2380f33a6e95244de31f192ed704a70da0d4eb5d77
MD5 07ae1d9ce9efe50928a553c999d9a7a0
BLAKE2b-256 57a66dea1c4948dbe428ac0214618d5273b3a5f3f82c988e39922328c6944448

See more details on using hashes here.

File details

Details for the file uni_tokenizer-0.1.1-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for uni_tokenizer-0.1.1-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 abf9108844c7b328f2311988fc96ebceeed00d01d685e72116f01e5623b66068
MD5 7f0af59f5d939e27d56e8bf3d055b313
BLAKE2b-256 b8b022d0e1e90827c4d69f9663818092c0d1c2377be60fa531c80af5efa49619

See more details on using hashes here.

File details

Details for the file uni_tokenizer-0.1.1-cp38-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for uni_tokenizer-0.1.1-cp38-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 3d390d44357d6485f561711cb6368c03878478a7a8f6b8c266ffcafaa02b7318
MD5 bd8d90466bf0cfcbcca5e14928ed2e14
BLAKE2b-256 ef054983c5f439c3ff6289ea2cfc926a15f852f8baf76b0068ad7922d831619b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page