Benchmark tokenizers across 84 languages, 17 code languages, math formulas, and edge cases

Project description

TokenizerBench

This dataset is designed to evaluate tokenizer performance across before you use it for model pre-training/finetuning:

🌍 Human languages (multilingual + scripts)
💻 Programming languages (syntax-heavy)
🧮 Math & science expressions (symbols, unicode, formulas)

🎯 Goal

This dataset helps evaluate:

Multilingual tokenization quality
Code token handling
Mathematical symbol parsing
Robustness to noisy and mixed inputs

🧩 How to Use the Dataset

The dataset is organized into modular Python files:

data/
├── human_languages.py
├── programming_languages.py
├── scientific_formulas.py
├── edge_cases.py

Each file contains structured dictionaries that can be directly imported and used for tokenizer evaluation.

📥 1. Import the Dataset

from tokenizerbench.data.human_languages import human_languages
from tokenizerbench.data.programming_languages import programming_languages
from tokenizerbench.data.scientific_formulas import scientific_formulas

🔄 2. Combine All Data (Optional)

dataset = {
    "human_languages": human_languages,
    "programming_languages": programming_languages,
    "scientific_formulas": scientific_formulas
}

🔍 3. Run Tokenizer Evaluation

Example using any tokenizer (HuggingFace, TikToken, SentencePiece, etc.):

def evaluate_tokenizer(tokenizer, dataset):
    results = {}

    for category, data in dataset.items():
        results[category] = {}

        for subcategory, samples in data.items():
            token_counts = []

            for text in samples:
                tokens = tokenizer.encode(text)
                token_counts.append(len(tokens))

            results[category][subcategory] = {
                "avg_tokens": sum(token_counts) / len(token_counts),
                "max_tokens": max(token_counts),
                "min_tokens": min(token_counts)
            }

    return results

📊 4. Evaluate Compression Efficiency

def compression_ratio(tokenizer, text):
    tokens = tokenizer.encode(text)
    return len(tokens) / len(text)

👉 Run this across:

Different languages
Code snippets
Math expressions

🌐 5. Test Unicode Robustness

def unicode_test(tokenizer, text):
    tokens = tokenizer.encode(text)
    decoded = tokenizer.decode(tokens)
    return text == decoded

Test on:

Multilingual text
Emojis
Scientific symbols

🧪 6. Long Sequence Testing

long_text = "AI_TOKEN_TEST " * 1000  # ~10K chars
tokens = tokenizer.encode(long_text)

print("Token count:", len(tokens))

👉 Helps evaluate:

Context handling
Token explosion
Memory efficiency

⚠️ 7. Recommended Evaluation Strategy

Run comparisons across:

Multiple tokenizers (BPE, SentencePiece, Unigram)
Multiple categories:
- Human languages
- Code
- Math & symbols

Track:

Token count
Compression ratio
Decode fidelity
Stability on long inputs

🧠 Pro Tip

For serious benchmarking, log results like:

{
  "tokenizer": "tiktoken",
  "language": "hindi",
  "avg_tokens": 18.2,
  "compression_ratio": 0.32,
  "unicode_safe": True
}

👉 This allows you to build:

Leaderboards
Tokenizer comparisons
Performance dashboards

📏 How to Measure Tokenizer Performance

1. Token Count

Measure how many tokens each input produces.

tokens = tokenizer.encode(text)
print(len(tokens))

👉 Lower token count (for same meaning) = better efficiency

2. Compression Ratio

compression_ratio = len(tokens) / len(text)

Lower ratio → better tokenizer
Indicates how efficiently text is represented

3. Unicode Handling

Test:

Multilingual text
Emojis
Mathematical symbols

test = "Hello 世界 🚀 α β γ ∑"
tokens = tokenizer.encode(test)
decoded = tokenizer.decode(tokens)

Check:

Is decoded text identical?
Any corruption?
Any token explosion?

4. Edge Case Robustness

Test:

Long sequences (2K–10K chars)
Mixed scripts
Noisy text

🎯 Goal

This dataset helps evaluate:

Multilingual tokenization quality
Code token handling
Mathematical symbol parsing
Robustness to noisy and long inputs

TODO

Expand human_languages → 100 languages using ISO language list
Keep same semantic structure across languages for consistency
Add longer sequences (2K–10K chars) to test tokenizer limits

Project details

Release history Release notifications | RSS feed

This version

0.2.0

Apr 3, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokenizerbench-0.2.0.tar.gz (91.7 kB view details)

Uploaded Apr 3, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

tokenizerbench-0.2.0-py3-none-any.whl (89.3 kB view details)

Uploaded Apr 3, 2026 Python 3

File details

Details for the file tokenizerbench-0.2.0.tar.gz.

File metadata

Download URL: tokenizerbench-0.2.0.tar.gz
Upload date: Apr 3, 2026
Size: 91.7 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for tokenizerbench-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`b6045140ae1b2e69564f03646da8d1388ca55dee08fafba46b23c4cef3fca6e0`
MD5	`4b54c5bfa36c378aa89b044b96c4ee14`
BLAKE2b-256	`5db4bac6e724abef2565d606c3707ae636bdb25c2975c027b1c43f0ec1de029e`

See more details on using hashes here.

Provenance

The following attestation bundles were made for tokenizerbench-0.2.0.tar.gz:

Publisher: python-publish.yml on kitefishai/TokenizerBench

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: tokenizerbench-0.2.0.tar.gz
- Subject digest: b6045140ae1b2e69564f03646da8d1388ca55dee08fafba46b23c4cef3fca6e0
- Sigstore transparency entry: 1226851320
- Sigstore integration time: Apr 3, 2026
Source repository:
- Permalink: kitefishai/TokenizerBench@c1d9c30fb671add9f79bcf810ef25c7c417c7f90
- Branch / Tag: refs/tags/v0.1
- Owner: https://github.com/kitefishai
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@c1d9c30fb671add9f79bcf810ef25c7c417c7f90
- Trigger Event: release

File details

Details for the file tokenizerbench-0.2.0-py3-none-any.whl.

File metadata

Download URL: tokenizerbench-0.2.0-py3-none-any.whl
Upload date: Apr 3, 2026
Size: 89.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for tokenizerbench-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c8cd848101d12fa3eb6823e188679dadd2283bc0456d919fd5a14ddb6c06a38b`
MD5	`dc8eb9d7647ca8fc0f37c90ea6606635`
BLAKE2b-256	`b11093dadea2068d92328aeb30d16f90b814b42c9def565a8a5ffce2761ef9c3`

See more details on using hashes here.

Provenance

The following attestation bundles were made for tokenizerbench-0.2.0-py3-none-any.whl:

Publisher: python-publish.yml on kitefishai/TokenizerBench

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: tokenizerbench-0.2.0-py3-none-any.whl
- Subject digest: c8cd848101d12fa3eb6823e188679dadd2283bc0456d919fd5a14ddb6c06a38b
- Sigstore transparency entry: 1226851327
- Sigstore integration time: Apr 3, 2026
Source repository:
- Permalink: kitefishai/TokenizerBench@c1d9c30fb671add9f79bcf810ef25c7c417c7f90
- Branch / Tag: refs/tags/v0.1
- Owner: https://github.com/kitefishai
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@c1d9c30fb671add9f79bcf810ef25c7c417c7f90
- Trigger Event: release

tokenizerbench 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

TokenizerBench

🎯 Goal

🧩 How to Use the Dataset

📥 1. Import the Dataset

🔄 2. Combine All Data (Optional)

🔍 3. Run Tokenizer Evaluation

📊 4. Evaluate Compression Efficiency

🌐 5. Test Unicode Robustness

🧪 6. Long Sequence Testing

⚠️ 7. Recommended Evaluation Strategy

🧠 Pro Tip

📏 How to Measure Tokenizer Performance

1. Token Count

2. Compression Ratio

3. Unicode Handling

4. Edge Case Robustness

🎯 Goal

TODO

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance