Benchmark tokenizers across 84 languages, 17 code languages, math formulas, and edge cases
Project description
TokenizerBench
This dataset is designed to evaluate tokenizer performance across before you use it for model pre-training/finetuning:
- 🌍 Human languages (multilingual + scripts)
- 💻 Programming languages (syntax-heavy)
- 🧮 Math & science expressions (symbols, unicode, formulas)
🎯 Goal
This dataset helps evaluate:
- Multilingual tokenization quality
- Code token handling
- Mathematical symbol parsing
- Robustness to noisy and mixed inputs
🧩 How to Use the Dataset
The dataset is organized into modular Python files:
data/
├── human_languages.py
├── programming_languages.py
├── scientific_formulas.py
├── edge_cases.py
Each file contains structured dictionaries that can be directly imported and used for tokenizer evaluation.
📥 1. Import the Dataset
from tokenizerbench.data.human_languages import human_languages
from tokenizerbench.data.programming_languages import programming_languages
from tokenizerbench.data.scientific_formulas import scientific_formulas
🔄 2. Combine All Data (Optional)
dataset = {
"human_languages": human_languages,
"programming_languages": programming_languages,
"scientific_formulas": scientific_formulas
}
🔍 3. Run Tokenizer Evaluation
Example using any tokenizer (HuggingFace, TikToken, SentencePiece, etc.):
def evaluate_tokenizer(tokenizer, dataset):
results = {}
for category, data in dataset.items():
results[category] = {}
for subcategory, samples in data.items():
token_counts = []
for text in samples:
tokens = tokenizer.encode(text)
token_counts.append(len(tokens))
results[category][subcategory] = {
"avg_tokens": sum(token_counts) / len(token_counts),
"max_tokens": max(token_counts),
"min_tokens": min(token_counts)
}
return results
📊 4. Evaluate Compression Efficiency
def compression_ratio(tokenizer, text):
tokens = tokenizer.encode(text)
return len(tokens) / len(text)
👉 Run this across:
- Different languages
- Code snippets
- Math expressions
🌐 5. Test Unicode Robustness
def unicode_test(tokenizer, text):
tokens = tokenizer.encode(text)
decoded = tokenizer.decode(tokens)
return text == decoded
Test on:
- Multilingual text
- Emojis
- Scientific symbols
🧪 6. Long Sequence Testing
long_text = "AI_TOKEN_TEST " * 1000 # ~10K chars
tokens = tokenizer.encode(long_text)
print("Token count:", len(tokens))
👉 Helps evaluate:
- Context handling
- Token explosion
- Memory efficiency
⚠️ 7. Recommended Evaluation Strategy
Run comparisons across:
-
Multiple tokenizers (BPE, SentencePiece, Unigram)
-
Multiple categories:
- Human languages
- Code
- Math & symbols
Track:
- Token count
- Compression ratio
- Decode fidelity
- Stability on long inputs
🧠 Pro Tip
For serious benchmarking, log results like:
{
"tokenizer": "tiktoken",
"language": "hindi",
"avg_tokens": 18.2,
"compression_ratio": 0.32,
"unicode_safe": True
}
👉 This allows you to build:
- Leaderboards
- Tokenizer comparisons
- Performance dashboards
📏 How to Measure Tokenizer Performance
1. Token Count
Measure how many tokens each input produces.
tokens = tokenizer.encode(text)
print(len(tokens))
👉 Lower token count (for same meaning) = better efficiency
2. Compression Ratio
compression_ratio = len(tokens) / len(text)
- Lower ratio → better tokenizer
- Indicates how efficiently text is represented
3. Unicode Handling
Test:
- Multilingual text
- Emojis
- Mathematical symbols
test = "Hello 世界 🚀 α β γ ∑"
tokens = tokenizer.encode(test)
decoded = tokenizer.decode(tokens)
Check:
- Is decoded text identical?
- Any corruption?
- Any token explosion?
4. Edge Case Robustness
Test:
- Long sequences (2K–10K chars)
- Mixed scripts
- Noisy text
🎯 Goal
This dataset helps evaluate:
- Multilingual tokenization quality
- Code token handling
- Mathematical symbol parsing
- Robustness to noisy and long inputs
TODO
- Expand human_languages → 100 languages using ISO language list
- Keep same semantic structure across languages for consistency
- Add longer sequences (2K–10K chars) to test tokenizer limits
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tokenizerbench-0.2.0.tar.gz.
File metadata
- Download URL: tokenizerbench-0.2.0.tar.gz
- Upload date:
- Size: 91.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b6045140ae1b2e69564f03646da8d1388ca55dee08fafba46b23c4cef3fca6e0
|
|
| MD5 |
4b54c5bfa36c378aa89b044b96c4ee14
|
|
| BLAKE2b-256 |
5db4bac6e724abef2565d606c3707ae636bdb25c2975c027b1c43f0ec1de029e
|
Provenance
The following attestation bundles were made for tokenizerbench-0.2.0.tar.gz:
Publisher:
python-publish.yml on kitefishai/TokenizerBench
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
tokenizerbench-0.2.0.tar.gz -
Subject digest:
b6045140ae1b2e69564f03646da8d1388ca55dee08fafba46b23c4cef3fca6e0 - Sigstore transparency entry: 1226851320
- Sigstore integration time:
-
Permalink:
kitefishai/TokenizerBench@c1d9c30fb671add9f79bcf810ef25c7c417c7f90 -
Branch / Tag:
refs/tags/v0.1 - Owner: https://github.com/kitefishai
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@c1d9c30fb671add9f79bcf810ef25c7c417c7f90 -
Trigger Event:
release
-
Statement type:
File details
Details for the file tokenizerbench-0.2.0-py3-none-any.whl.
File metadata
- Download URL: tokenizerbench-0.2.0-py3-none-any.whl
- Upload date:
- Size: 89.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c8cd848101d12fa3eb6823e188679dadd2283bc0456d919fd5a14ddb6c06a38b
|
|
| MD5 |
dc8eb9d7647ca8fc0f37c90ea6606635
|
|
| BLAKE2b-256 |
b11093dadea2068d92328aeb30d16f90b814b42c9def565a8a5ffce2761ef9c3
|
Provenance
The following attestation bundles were made for tokenizerbench-0.2.0-py3-none-any.whl:
Publisher:
python-publish.yml on kitefishai/TokenizerBench
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
tokenizerbench-0.2.0-py3-none-any.whl -
Subject digest:
c8cd848101d12fa3eb6823e188679dadd2283bc0456d919fd5a14ddb6c06a38b - Sigstore transparency entry: 1226851327
- Sigstore integration time:
-
Permalink:
kitefishai/TokenizerBench@c1d9c30fb671add9f79bcf810ef25c7c417c7f90 -
Branch / Tag:
refs/tags/v0.1 - Owner: https://github.com/kitefishai
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@c1d9c30fb671add9f79bcf810ef25c7c417c7f90 -
Trigger Event:
release
-
Statement type: