True UTF-8 tokenizer for byte level models
Project description
Back to Bytes: Revisiting Tokenization Through UTF-8
Full writeup can be found in our paper.
This module includes a real byte level tokenizer for text, which encodes text into a sequence of bytes (0-255).
Unlike ByT5Tokenizer for example, UTF8Tokenizer is implemented from scratch, and is much more efficient.
Other "Byte Level" tokenizers usually include various additional "special tokens" (e.g., <pad>, <unk>, etc.),
making the encoding and decoding logic more complex, and the token ids larger than 255.
Instead, we rely on C0 Control characters (0-31) as special tokens, which are not used in normal text.
Usage
pip install utf8-tokenizer
Tokenization:
from utf8_tokenizer.tokenizer import UTF8Tokenizer
tokenizer = UTF8Tokenizer()
texts = ["word", "or multiple"]
print(tokenizer(texts))
Chat Template:
from utf8_tokenizer.tokenizer import UTF8Tokenizer
from utf8_tokenizer.control import visualize_control_tokens
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hey, what's 1+1?"},
{"role": "assistant", "content": "1+1 is 2."},
]
tokenizer = UTF8Tokenizer()
text = tokenizer.apply_chat_template(messages, tokenize=False)
# Visualize the text with special tokens
print(visualize_control_tokens(text))
Bit-biased byte embeddings:
from transformers import AutoModelForCausalLM
# Load example model
model = AutoModelForCausalLM.from_pretrained("sbintuitions/tiny-lm")
model.resize_token_embeddings(256)
from utf8_tokenizer.embeddings import patch_embedding_layers, join_embedding_layers
patch_embedding_layers(model) # Apply bit-bias for training
#
# Train your model...
#
join_embedding_layers(model) # Fold to a single embedding layer for inference
UTF-8 Validation during Generation:
from transformers import AutoModelForCausalLM
from utf8_tokenizer import UTF8Tokenizer, UTF8ValidationLogitsProcessor
# Load your byte-level model
model = AutoModelForCausalLM.from_pretrained("your-model")
tokenizer = UTF8Tokenizer()
# Create the UTF-8 validation processor
utf8_processor = UTF8ValidationLogitsProcessor()
# Generate text with UTF-8 validation
input_text = "Hello"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids
outputs = model.generate(
input_ids,
logits_processor=[utf8_processor], # Ensures valid UTF-8 sequences
max_new_tokens=100
)
# Decode the output
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)
The UTF8ValidationLogitsProcessor prevents byte-level tokenizers from generating malformed UTF-8 sequences by masking invalid byte continuations during generation. This addresses the issue discussed in Firestone et al. 2024 where byte-level tokenizers can generate ill-formed UTF-8.
Benchmark
Tokenization Speed
python experiments/benchmark.py
On MacBook Pro, with Apple M4 Pro chip, just converting texts of 75 words in different languages to bytes, without wrapping them in tensors, creating attention masks, or padding, runs at 109.9k/sec.
Calling the ByT5 tokenizer runs at 0.4k/sec.
When we call our new tokenizer, through the __call__ path, we get 0.5k/sec, which is a bit faster.
Our optimized version with zero-copy runs at 66k/sec, where the loss of performance compared to the raw ints is in padding the input ids into a properly padded tensor. This is a 164x speedup over the original tokenizer.
Bit-Biased Byte Embedding
We train a small language model with and without bit-bias.
Our results reveal that bit-bias improves both loss and accuracy, while increasing training time by about 1%. We hope that our bit-level embeddings module can be further optimized, to minimize the training overhead.
Cite
If you use this code in your research, please consider citing the work:
@misc{moryossef2025utf8,
title = {Back to Bytes: Revisiting Tokenization Through {UTF-8}},
author = {Amit Moryossef and Clara Meister and Pavel Stepachev and Desmond Elliott},
howpublished = {\url{https://github.com/sign/utf8-tokenizer}},
eprint = {2510.16987},
archivePrefix = {arXiv},
primaryClass = {cs.CL},
url = {https://arxiv.org/abs/2510.16987},
year = {2025}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file utf8_tokenizer-0.8.2.tar.gz.
File metadata
- Download URL: utf8_tokenizer-0.8.2.tar.gz
- Upload date:
- Size: 38.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fe3ccea501ce3e72b3d2d8d9bc30835972274a433ae3b81b8b1ecbdf789dce49
|
|
| MD5 |
256ebbfbbe02a35bb234fc622f118109
|
|
| BLAKE2b-256 |
ac7b06a2e08d217632950b7d78fdcf04e44ff3822ab004c2638c80a4e0f85ec1
|
Provenance
The following attestation bundles were made for utf8_tokenizer-0.8.2.tar.gz:
Publisher:
release.yaml on sign/utf8-tokenizer
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
utf8_tokenizer-0.8.2.tar.gz -
Subject digest:
fe3ccea501ce3e72b3d2d8d9bc30835972274a433ae3b81b8b1ecbdf789dce49 - Sigstore transparency entry: 958946619
- Sigstore integration time:
-
Permalink:
sign/utf8-tokenizer@e86107e0c2037b06e3fc0da872af2a6ae7bf1c6d -
Branch / Tag:
refs/tags/v0.8.2 - Owner: https://github.com/sign
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yaml@e86107e0c2037b06e3fc0da872af2a6ae7bf1c6d -
Trigger Event:
release
-
Statement type:
File details
Details for the file utf8_tokenizer-0.8.2-py3-none-any.whl.
File metadata
- Download URL: utf8_tokenizer-0.8.2-py3-none-any.whl
- Upload date:
- Size: 22.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d2d79bd91246b1fcbf014ab8981ae7f5a44efab2acf97ac8b155d4ad3341dc68
|
|
| MD5 |
3c772c83466ff3298b6a94ee752f6cc3
|
|
| BLAKE2b-256 |
8728914c0469a3e516d9d75281a9001951309957909672dc1502c549b0e53bd6
|
Provenance
The following attestation bundles were made for utf8_tokenizer-0.8.2-py3-none-any.whl:
Publisher:
release.yaml on sign/utf8-tokenizer
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
utf8_tokenizer-0.8.2-py3-none-any.whl -
Subject digest:
d2d79bd91246b1fcbf014ab8981ae7f5a44efab2acf97ac8b155d4ad3341dc68 - Sigstore transparency entry: 958946661
- Sigstore integration time:
-
Permalink:
sign/utf8-tokenizer@e86107e0c2037b06e3fc0da872af2a6ae7bf1c6d -
Branch / Tag:
refs/tags/v0.8.2 - Owner: https://github.com/sign
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yaml@e86107e0c2037b06e3fc0da872af2a6ae7bf1c6d -
Trigger Event:
release
-
Statement type: