Production-grade tokenizer achieving >16M tokens/s via AVX2/SIMD optimizations and Double-Array Trie engine.
Project description
๐๏ธ XERV Crayon
The Cartridge-Based Tokenizer for Specialized AI
Why force a single bloated vocabulary on every problem?
Crayon is a next-generation tokenizer designed for specialization. Hot-swap vocabulary profiles ("Cartridges") optimized for your domainโQuantum Physics, Rust Programming, Financial Law, or anything in between.
๐ Key Features
| Feature | Description |
|---|---|
| ๐พ Cartridge System | Instantly hot-swap specialized vocabularies (science, code, multilingual) |
| โก AVX2 Double-Array Trie | Validated ~10M tokens/sec via SIMD-accelerated branchless tokenization |
| ๐บ๏ธ Zero-Copy Memory Mapping | DAT files loaded via mmap for instant startup & minimal RAM |
| ๐ Zero-Disk Streaming | Build profiles directly from Hugging Faceโno multi-GB downloads |
| ๐ก๏ธ Offline Resilience | Seamless local bootstrap fallback. Works offline out-of-the-box |
| ๐ง Entropy-Guided Construction | Information-theoretic token selection for maximum domain efficiency |
๐ Benchmarks โ The Numbers Speak
100% HONEST. NO SUGARCOATING. DATA-DRIVEN.
Run
python benchmark_competitive.pyto reproduce these results yourself.
โก Speed Comparison
| Tokenizer | Tokens/sec | vs CRAYON |
|---|---|---|
| ๐๏ธ CRAYON (lite, 50k) | 6,010,525 | baseline |
| tiktoken (GPT-4) | 524,469 | 11.5x slower |
| tiktoken (GPT-3) | 466,823 | 12.9x slower |
| HF LLaMA (SP-BPE) | 281,558 | 21.3x slower |
| HF GPT-2 (BPE) | 237,117 | 25.3x slower |
| HF BERT (WordPiece) | 202,269 | 29.7x slower |
| HF T5 (SentencePiece) | 189,928 | 31.6x slower |
๐ Full Benchmark Results
| Tokenizer | Vocab Size | Tokens/sec | MB/sec | Load Time | Avg Time |
|---|---|---|---|---|---|
| CRAYON (lite, 50k) | 50,000 | 6,010,525 | 15.33 | 0.54ms | 4.56ms |
| tiktoken (cl100k/GPT-4) | 100,000 | 524,469 | 2.18 | 0.01ms | 32.03ms |
| tiktoken (p50k/GPT-3) | 50,000 | 466,823 | 1.55 | 0.00ms | 44.98ms |
| HF LLaMA (SP-BPE) | 32,000 | 281,558 | 0.95 | 1212.02ms | 73.52ms |
| HF GPT-2 (BPE) | 50,257 | 237,117 | 0.69 | 2051.18ms | 100.79ms |
| HF BERT (WordPiece) | 30,522 | 202,269 | 0.73 | 1603.10ms | 95.43ms |
| HF T5 (SentencePiece) | 32,000 | 189,928 | 0.68 | 1727.91ms | 102.15ms |
๐ Test Environment & Methodology
- Platform: Windows AMD64, Python 3.13.1
- Test Text: 68.4 KB mixed content (code, prose, multilingual)
- Iterations: 10 runs + 2 warmup per tokenizer
- Full methodology: BENCHMARK_RESULTS.md
๐ Key Takeaways
| Metric | Result |
|---|---|
| โ vs tiktoken (GPT-4) | 11.5x faster |
| โ vs HuggingFace GPT-2 | 25x faster |
| โ Load time | 0.54ms (vs 1-2s for HuggingFace) |
| โ Peak throughput | 10.4M tokens/sec (science profile) |
โก Quick Start
Get tokenizing in under 60 seconds:
Option 1: Direct DAT Compilation
import json
import mmap
from crayon.c_ext.dat_builder import DATBuilder
from crayon.c_ext import crayon_fast
# Load any trained vocabulary
with open("trained_vocab_code.json", "r") as f:
vocab_list = json.load(f)
# Compile to DAT (one-time, few seconds)
builder = DATBuilder()
builder.build(vocab_list)
builder.save("vocab_code.dat")
# Load into C++ engine via memory mapping (instant, <1ms)
with open("vocab_code.dat", "rb") as f:
mm = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
crayon_fast.load_dat(mm)
# Ultra-fast tokenization ๐
code = 'fn main() { println!("Hello, World!"); }'
tokens = crayon_fast.tokenize(code)
print(f"Tokens: {tokens}")
Option 2: Profile System (Recommended)
from crayon.core.vocabulary import CrayonVocab
# Load pre-compiled profile (requires one-time compile_profiles.py)
vocab = CrayonVocab.load_profile("code")
tokens = vocab.tokenize("fn main() { }")
decoded = vocab.decode(tokens)
print(f"Decoded: {decoded}")
๐ฆ Installation
git clone https://github.com/Xerv-AI/crayon.git
cd crayon
pip install -e .
Build the AVX2 Extension
python setup.py build_ext --inplace
Note: Requires a C++ compiler (MSVC on Windows, GCC/Clang on Linux/Mac).
๐ง One-Time Setup: Compile Profiles
# Builds .dat files โ ~/.cache/xerv/crayon/profiles/
python compile_profiles.py
Each profile takes 38ms-26s depending on size. See DAT_BUILDING_EXPLAINED.md for details.
๐งช Verify Installation
python demo_tokenize.py
Expected output:
[1] Loading 'lite' profile...
Status: ๐ Fast C++ DAT Engine
[2] Tokenizing: 'Hello, world! This is Crayon.'
Tokens IDs: [...]
๐๏ธ DAT Engine V2 Architecture
Crayon V2 uses a "God Tier" implementation combining:
โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ
โ vocab.json โ โโโถ โ DATBuilder โ โโโถ โ vocab.dat โ โโโถ โ C++ Engine โ
โ (List) โ โ (Python) โ โ (Binary) โ โ (AVX2) โ
โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ
| Component | File | Purpose |
|---|---|---|
| Offline Compiler | dat_builder.py |
First-Fit algorithm โ compact DAT binary |
| AVX2 Runtime | engine.cpp |
Branchless state transitions + SIMD parallel ASCII |
| Zero-Copy Loader | mmap + buffer protocol |
Instant startup, minimal RAM |
๐งฉ Available Cartridges
5 production-ready profiles defined in src/crayon/core/profiles.py:
| Profile | Size | Optimized For | Sources |
|---|---|---|---|
lite |
50k | Speed & Mobile | WikiText, RainDrop |
science |
250k | Reasoning (LaTeX, Quantum, Grad Math) | GRAD, Physics-700 |
code |
250k | Syntax (Python, Rust, C++, JS) | CodeParrot, The Stack |
multilingual |
250k | Global (EU langs, Chinese, Hindi) | OSCAR, Wikipedia |
arts_commerce |
250k | Business (Legal, Finance, Lit) | PG19, Fin Phrasebank |
vocab = CrayonVocab.load_profile("science")
vocab = CrayonVocab.load_profile("multilingual")
๐ ๏ธ Advanced Usage
Compile Vocabulary to DAT Format
from crayon.c_ext.dat_builder import DATBuilder
import json
with open("trained_vocab_lite.json", "r") as f:
vocab = json.load(f)
builder = DATBuilder()
builder.build(vocab)
builder.save("vocab_lite.dat")
Direct C++ Engine Access
import mmap
from crayon.c_ext import crayon_fast
with open("vocab_lite.dat", "rb") as f:
mm = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
crayon_fast.load_dat(mm)
tokens = crayon_fast.tokenize("Your text here")
Force Rebuild / Offline Mode
# Rebuild from local resources only (fastest)
vocab = CrayonVocab.load_profile("arts_commerce", force_rebuild=True)
๐๏ธ Architecture
| Layer | File | Purpose |
|---|---|---|
| Builder | c_ext/dat_builder.py |
Offline DAT compiler |
| Engine | c_ext/engine.cpp |
AVX2 SIMD runtime |
| Config | core/profiles.py |
Cartridge definitions |
| Resources | resources.py |
Streaming, fallbacks, caching |
For a deep dive, read the Engineering Treatise.
๐งช Testing
# All tests
python -m pytest tests/ -v
# DAT engine tests
python -m pytest tests/test_c_ext.py -v
14/14 tests pass: DATBuilder, C++ module, full pipeline, Python fallback.
๐ฌ DAT Engine Verification
python verify_dat_engine.py
============================================================
XERV CRAYON V2.0 - HYPER-PRODUCTION DAT ENGINE VERIFICATION
============================================================
Vocabulary Size: 50,000 tokens
DAT Nodes: 163,000+
Throughput: 9,786,707 tokens/sec
STATUS: โ
HYPER-PRODUCTION READY
๐ Training Data
| Dataset | Size | Samples | Domain |
|---|---|---|---|
| Tiny Shakespeare | 1.06 MB | 1 (Full) | Classical Literature |
| RainDrop-DTS | 179 KB | 3,210 | Instruction Following |
| Physics | 332 KB | 700 | Scientific Reasoning |
| GRAD Math | 5.00 MB | 500* | Graduate Mathematics |
| TOTAL | ~6.56 MB | 4,411 | Curated Corpus |
*GRAD dataset limited to 500 high-density samples for efficient default build.
๐งฉ API Reference
CrayonVocab
# Constructors
CrayonVocab(tokens: List[str], unk_token: str = "<UNK>")
CrayonVocab.from_corpus(corpus: str, target_size: int = 500000)
CrayonVocab.from_default_sources(vocab_size: int = 500000)
CrayonVocab.from_file(path: str)
CrayonVocab.from_json(path: str)
CrayonVocab.load_profile(name: str) # Load cached DAT profiles
# Methods
vocab.tokenize(text: str) -> List[int]
vocab.decode(token_ids: List[int]) -> str
vocab.save(path: str, format: str = "txt")
DAT Builder
from crayon.c_ext.dat_builder import DATBuilder
builder = DATBuilder()
builder.build(vocab_list: List[str])
builder.save(output_path: str)
C++ Engine
from crayon.c_ext import crayon_fast
crayon_fast.load_dat(buffer) # bytes, mmap, or memoryview
crayon_fast.tokenize(text: str) -> List[int]
Utilities
from crayon import check_c_extension, check_resources
print(check_c_extension()) # True/False
print(check_resources()) # Available data sources
๐ค Contributing
We welcome contributions! Whether it's new cartridges, performance optimizations, or bug fixesโopen an issue or submit a PR.
๐ Citation
@techreport{xerv2026crayon,
title={XERV Crayon: A First-Principles Analysis of Production-Grade Tokenization},
author={Pal, Soham and Xerv Research},
year={2026},
institution={Xerv Research Engineering Division}
}
๐ License
Copyright (c) 2025-2026 Xerv Research. Released under the MIT License.
Built with ๐ by Xerv Research Engineering Division
โญ Star this repo if Crayon helps your project!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file xerv_crayon-2.0.3.tar.gz.
File metadata
- Download URL: xerv_crayon-2.0.3.tar.gz
- Upload date:
- Size: 7.5 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b266cc659734167cddac76ff782e791a21ab6ca220d7bbb2cdbe0c52c6d3725d
|
|
| MD5 |
19b7cf334d0f36ea32e82ab7713d6e0b
|
|
| BLAKE2b-256 |
b714386810006ee2ec33cc3448fa7a190b5eaf13ca75aa73562eaa12b524c19e
|
File details
Details for the file xerv_crayon-2.0.3-cp313-cp313-win_amd64.whl.
File metadata
- Download URL: xerv_crayon-2.0.3-cp313-cp313-win_amd64.whl
- Upload date:
- Size: 6.0 MB
- Tags: CPython 3.13, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c0cd245f795ce5ca44bfb8ddf11e0f662b48472ae0e52a936bc9df73f5c00f2b
|
|
| MD5 |
e1b84cf4a503cc79f4fa79f73d35061f
|
|
| BLAKE2b-256 |
e5e0def2e5137d9e48e661b430d0b073fa07bf2c433101925be51571678fb3aa
|