Skip to main content

Benchmark tokenizers across 84 languages, 17 code languages, math formulas, and edge cases

Project description

TokenizerBench

This dataset is designed to evaluate tokenizer performance across before you use it for model pre-training/finetuning:

  • 🌍 Human languages (multilingual + scripts)
  • 💻 Programming languages (syntax-heavy)
  • 🧮 Math & science expressions (symbols, unicode, formulas)

🎯 Goal

This dataset helps evaluate:

  • Multilingual tokenization quality
  • Code token handling
  • Mathematical symbol parsing
  • Robustness to noisy and mixed inputs

🧩 How to Use the Dataset

The dataset is organized into modular Python files:

data/
├── human_languages.py
├── programming_languages.py
├── scientific_formulas.py
├── edge_cases.py

Each file contains structured dictionaries that can be directly imported and used for tokenizer evaluation.


📥 1. Import the Dataset

from tokenizerbench.data.human_languages import human_languages
from tokenizerbench.data.programming_languages import programming_languages
from tokenizerbench.data.scientific_formulas import scientific_formulas

🔄 2. Combine All Data (Optional)

dataset = {
    "human_languages": human_languages,
    "programming_languages": programming_languages,
    "scientific_formulas": scientific_formulas
}

🔍 3. Run Tokenizer Evaluation

Example using any tokenizer (HuggingFace, TikToken, SentencePiece, etc.):

def evaluate_tokenizer(tokenizer, dataset):
    results = {}

    for category, data in dataset.items():
        results[category] = {}

        for subcategory, samples in data.items():
            token_counts = []

            for text in samples:
                tokens = tokenizer.encode(text)
                token_counts.append(len(tokens))

            results[category][subcategory] = {
                "avg_tokens": sum(token_counts) / len(token_counts),
                "max_tokens": max(token_counts),
                "min_tokens": min(token_counts)
            }

    return results

📊 4. Evaluate Compression Efficiency

def compression_ratio(tokenizer, text):
    tokens = tokenizer.encode(text)
    return len(tokens) / len(text)

👉 Run this across:

  • Different languages
  • Code snippets
  • Math expressions

🌐 5. Test Unicode Robustness

def unicode_test(tokenizer, text):
    tokens = tokenizer.encode(text)
    decoded = tokenizer.decode(tokens)
    return text == decoded

Test on:

  • Multilingual text
  • Emojis
  • Scientific symbols

🧪 6. Long Sequence Testing

long_text = "AI_TOKEN_TEST " * 1000  # ~10K chars
tokens = tokenizer.encode(long_text)

print("Token count:", len(tokens))

👉 Helps evaluate:

  • Context handling
  • Token explosion
  • Memory efficiency

⚠️ 7. Recommended Evaluation Strategy

Run comparisons across:

  • Multiple tokenizers (BPE, SentencePiece, Unigram)

  • Multiple categories:

    • Human languages
    • Code
    • Math & symbols

Track:

  • Token count
  • Compression ratio
  • Decode fidelity
  • Stability on long inputs

🧠 Pro Tip

For serious benchmarking, log results like:

{
  "tokenizer": "tiktoken",
  "language": "hindi",
  "avg_tokens": 18.2,
  "compression_ratio": 0.32,
  "unicode_safe": True
}

👉 This allows you to build:

  • Leaderboards
  • Tokenizer comparisons
  • Performance dashboards

📏 How to Measure Tokenizer Performance

1. Token Count

Measure how many tokens each input produces.

tokens = tokenizer.encode(text)
print(len(tokens))

👉 Lower token count (for same meaning) = better efficiency


2. Compression Ratio

compression_ratio = len(tokens) / len(text)
  • Lower ratio → better tokenizer
  • Indicates how efficiently text is represented

3. Unicode Handling

Test:

  • Multilingual text
  • Emojis
  • Mathematical symbols
test = "Hello 世界 🚀 α β γ ∑"
tokens = tokenizer.encode(test)
decoded = tokenizer.decode(tokens)

Check:

  • Is decoded text identical?
  • Any corruption?
  • Any token explosion?

4. Edge Case Robustness

Test:

  • Long sequences (2K–10K chars)
  • Mixed scripts
  • Noisy text

🎯 Goal

This dataset helps evaluate:

  • Multilingual tokenization quality
  • Code token handling
  • Mathematical symbol parsing
  • Robustness to noisy and long inputs

TODO

  • Expand human_languages → 100 languages using ISO language list
  • Keep same semantic structure across languages for consistency
  • Add longer sequences (2K–10K chars) to test tokenizer limits

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokenizerbench-0.2.0.tar.gz (91.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tokenizerbench-0.2.0-py3-none-any.whl (89.3 kB view details)

Uploaded Python 3

File details

Details for the file tokenizerbench-0.2.0.tar.gz.

File metadata

  • Download URL: tokenizerbench-0.2.0.tar.gz
  • Upload date:
  • Size: 91.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for tokenizerbench-0.2.0.tar.gz
Algorithm Hash digest
SHA256 b6045140ae1b2e69564f03646da8d1388ca55dee08fafba46b23c4cef3fca6e0
MD5 4b54c5bfa36c378aa89b044b96c4ee14
BLAKE2b-256 5db4bac6e724abef2565d606c3707ae636bdb25c2975c027b1c43f0ec1de029e

See more details on using hashes here.

Provenance

The following attestation bundles were made for tokenizerbench-0.2.0.tar.gz:

Publisher: python-publish.yml on kitefishai/TokenizerBench

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file tokenizerbench-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: tokenizerbench-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 89.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for tokenizerbench-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c8cd848101d12fa3eb6823e188679dadd2283bc0456d919fd5a14ddb6c06a38b
MD5 dc8eb9d7647ca8fc0f37c90ea6606635
BLAKE2b-256 b11093dadea2068d92328aeb30d16f90b814b42c9def565a8a5ffce2761ef9c3

See more details on using hashes here.

Provenance

The following attestation bundles were made for tokenizerbench-0.2.0-py3-none-any.whl:

Publisher: python-publish.yml on kitefishai/TokenizerBench

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page