True UTF-8 tokenizer for byte level models

Project description

Back to Bytes: Revisiting Tokenization Through `UTF-8`

Full writeup can be found in the paper.

This module includes a real byte level tokenizer for text, which encodes text into a sequence of bytes (0-255). Unlike ByT5Tokenizer for example, UTF8Tokenizer is implemented from scratch, and is much more efficient.

Other "Byte Level" tokenizers usually include various additional "special tokens" (e.g., <pad>, <unk>, etc.), making the encoding and decoding logic more complex, and the token ids larger than 255.

Instead, we rely on C0 Control characters (0-31) as special tokens, which are not used in normal text.

Usage

Tokenization:

from utf8_tokenizer.tokenizer import UTF8Tokenizer

tokenizer = UTF8Tokenizer()

texts = ["word", "or multiple"]
print(tokenizer(texts))

# Very fast version
print(tokenizer.torch(texts))

Bit-biased byte embeddings:

from transformers import AutoModelForCausalLM

# Load example model
model = AutoModelForCausalLM.from_pretrained("sbintuitions/tiny-lm")
model.resize_token_embeddings(256)

from utf8_tokenizer.embeddings import patch_embedding_layers, join_embedding_layers

patch_embedding_layers(model) # Apply bit-bias for training

#
# Train your model...
#

join_embedding_layers(model) # Fold to a single embedding layer for inference

Benchmark

Tokenization Speed

python experiments/benchmark.py

On MacBook Pro, with Apple M4 Pro chip, just converting texts of 6 words in different languages to bytes, without wrapping them in tensors, creating attention masks, or padding, runs at 127.4k/sec.

Calling the ByT5 tokenizer runs at 6.2k/sec. When we call our new tokenizer, through the __call__ path, we get 10.5k/sec, which is a bit faster.

Our optimized version with zero-copy runs at 86.7k/sec, where the loss of performance compared to the raw ints is in padding the input ids into a properly padded tensor. This is a 14x speedup over the original tokenizer.

Bit-Biased Byte Embedding

We train a small language model with and without bit-bias.

Our results reveal that bit-bias improves both loss and accuracy, while increasing training time by about 1%. We hope that our bit-level embeddings module can be further optimized, to minimize the training overhead.

Cite

If you use this code in your research, please consider citing the work:

@misc{moryossef2025utf8,
  title={Back to Bytes: Revisiting Tokenization Through {UTF-8}},
  author={Moryossef, Amit},
  howpublished={\url{https://github.com/sign/utf8-tokenizer}},
  year={2025}
}

Project details

Release history Release notifications | RSS feed

0.8.2

Feb 17, 2026

0.8.1

Feb 9, 2026

0.8.0

Feb 9, 2026

0.7.1

Jan 31, 2026

0.7.0

Jan 30, 2026

0.6.4

Jan 25, 2026

0.6.3

Jan 12, 2026

0.6.2

Jan 12, 2026

0.6.1

Jan 12, 2026

0.6.0

Jan 11, 2026

0.5.0

Jan 9, 2026

0.4.0

Dec 27, 2025

0.3.0

Dec 5, 2025

0.2.0

Nov 11, 2025

0.1.2

Oct 10, 2025

0.1.1

Oct 1, 2025

This version

0.1.0

Sep 16, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

utf8_tokenizer-0.1.0.tar.gz (12.3 kB view details)

Uploaded Sep 16, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

utf8_tokenizer-0.1.0-py3-none-any.whl (8.5 kB view details)

Uploaded Sep 16, 2025 Python 3

File details

Details for the file utf8_tokenizer-0.1.0.tar.gz.

File metadata

Download URL: utf8_tokenizer-0.1.0.tar.gz
Upload date: Sep 16, 2025
Size: 12.3 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for utf8_tokenizer-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`fc489eab831a06f6dbe1d553e4a35cf5304fc1edcf8abe78168620e7f0f18e94`
MD5	`2f9ba72f6051139d74116bc93ba66e90`
BLAKE2b-256	`671d7ef8234a69094e53f9102de88764d5530f1ce2672fbc92f24f648da249ba`

See more details on using hashes here.

Provenance

The following attestation bundles were made for utf8_tokenizer-0.1.0.tar.gz:

Publisher: release.yaml on sign/utf8-tokenizer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: utf8_tokenizer-0.1.0.tar.gz
- Subject digest: fc489eab831a06f6dbe1d553e4a35cf5304fc1edcf8abe78168620e7f0f18e94
- Sigstore transparency entry: 522983104
- Sigstore integration time: Sep 16, 2025
Source repository:
- Permalink: sign/utf8-tokenizer@7f3b664b8372af352caceb3834ba7e433f4cf52d
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/sign
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yaml@7f3b664b8372af352caceb3834ba7e433f4cf52d
- Trigger Event: release

File details

Details for the file utf8_tokenizer-0.1.0-py3-none-any.whl.

File metadata

Download URL: utf8_tokenizer-0.1.0-py3-none-any.whl
Upload date: Sep 16, 2025
Size: 8.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for utf8_tokenizer-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a52ac70c3bff83b73eacab4d2f9575983d0b281223c2401abb4efec1cc3b814b`
MD5	`f4646db8223d6bde3fefbea4924d6cbb`
BLAKE2b-256	`551030f89e5389bd006ec77f5b18d25fb1f8b44962a333ec4e8fe70816153b07`

See more details on using hashes here.

Provenance

The following attestation bundles were made for utf8_tokenizer-0.1.0-py3-none-any.whl:

Publisher: release.yaml on sign/utf8-tokenizer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: utf8_tokenizer-0.1.0-py3-none-any.whl
- Subject digest: a52ac70c3bff83b73eacab4d2f9575983d0b281223c2401abb4efec1cc3b814b
- Sigstore transparency entry: 522983138
- Sigstore integration time: Sep 16, 2025
Source repository:
- Permalink: sign/utf8-tokenizer@7f3b664b8372af352caceb3834ba7e433f4cf52d
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/sign
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yaml@7f3b664b8372af352caceb3834ba7e433f4cf52d
- Trigger Event: release

utf8-tokenizer 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Back to Bytes: Revisiting Tokenization Through `UTF-8`

Usage

Benchmark

Tokenization Speed

Bit-Biased Byte Embedding

Cite

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

utf8-tokenizer 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Back to Bytes: Revisiting Tokenization Through UTF-8

Usage

Benchmark

Tokenization Speed

Bit-Biased Byte Embedding

Cite

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

Back to Bytes: Revisiting Tokenization Through `UTF-8`