Omni-Backend Tokenizer - CPU (AVX2/512), CUDA (NVIDIA), ROCm (AMD) with automatic hardware detection
Project description
๐๏ธ XERV Crayon v5.0.1
The Omni-Backend Tokenizer for Specialized AI
Why force a single bloated vocabulary on every problem?
Crayon is a next-generation tokenizer designed for specialization. Hot-swap vocabulary profiles ("Cartridges") optimized for your domainโQuantum Physics, Rust Programming, Financial Law, or anything in between.
๐ Key Features
| Feature | Description |
|---|---|
| ๐พ Cartridge System | Instantly hot-swap specialized vocabularies (science, code, multilingual) |
| ๐ Omni-Backend | Auto-detects & runs on CPU (AVX2), NVIDIA (CUDA), or AMD (ROCm) |
| โก Hyper-Fast Trainer | C++17 Linked-List BPE trains vocabularies in seconds (100x faster) |
| โก Native GPU Kernels | "Bare Metal" C++/CUDA/HIP kernels (no wrappers) for >10M tokens/sec |
| ๐บ๏ธ Zero-Copy Mapping | DAT files loaded via mmap for instant startup & minimal RAM |
| ๐ Zero-Disk Streaming | Build profiles directly from Hugging Faceโno multi-GB downloads |
| ๐ก๏ธ Offline Resilience | Seamless local bootstrap fallback. Works offline out-of-the-box |
๐ Benchmarks โ Production Results
DATA-DRIVEN. NO HYPE. 100% VERIFIED.
๐ฅ CPU Performance (Intel i3-7020U AVX2)
Even on modest consumer hardware, Crayon's SIMD-accelerated engine outperforms industry standards by 50x - 100x.
| Tokenizer | Tokens/Sec | Speedup vs Crayon |
|---|---|---|
| CRAYON (Science) | 40,808,299 | 1.0x (Baseline) |
| CRAYON (Code) | 34,742,588 | 1.2x slower |
| Tiktoken (GPT-4) | 608,610 | 67.0x slower |
| HF LLaMA | 343,282 | 118.8x slower |
| HF GPT-2 | 307,563 | 132.6x slower |
| HF BERT | 195,108 | 209.1x slower |
โก GPU Performance (Tesla T4)
โก Installation Summary (T4 GPU Environment)
======================================================================
XERV CRAYON V4.1.9 INSTALLATION AND BENCHMARKS
======================================================================
[1/7] Checking environment...
PyTorch: 2.9.0+cu126
CUDA: 12.6 (Tesla T4)
* Smart Build: Will compile ONLY for this GPU architecture
NVCC: /usr/local/cuda/bin/nvcc
[2/7] Installing build dependencies...
Done (ninja, packaging, wheel)
[3/7] Cleaning previous installations...
[4/7] Cloning source code...
__version__ = "4.1.9"
[5/7] Compiling and Installing (Streaming Logs)...
----------------------------------------------------------------------
[CRAYON-BUILD] Detected GPU: SM 7.5 -> Compiling for sm_75 ONLY
[CRAYON-BUILD] Configuring CUDA extension (max_jobs=1)
building 'crayon.c_ext.crayon_cpu' extension
[1/1] c++ -O3 -march=native -mavx2 -fPIC -std=c++17
Successfully built crayon_cpu.so
building 'crayon.c_ext.crayon_cuda' extension
[1/1] nvcc -O3 -std=c++17 --expt-relaxed-constexpr -gencode=arch=compute_75,code=sm_75
Successfully built crayon_cuda.so
Successfully installed xerv-crayon-4.1.9
----------------------------------------------------------------------
[6/7] Verifying installation...
Success! Installed version: 4.1.9
Backends: {'cpu': True, 'cuda': True, 'rocm': False}
๐ฅ Performance Results (T4 GPU vs Tiktoken)
CRAYON (CUDA Backend - Tesla T4):
Active Device: CUDA
Backend: cuda_extension
Batch Throughput (XERV CRAYON):
1,000 docs: 748,048 docs/sec | 9,724,621 tokens/sec
10,000 docs: 639,239 docs/sec | 8,310,109 tokens/sec
50,000 docs: 781,129 docs/sec | 10,154,678 tokens/sec
Tiktoken (cl100k_base - CPU):
Tiktoken Batch Throughput (cl100k_base encoding):
1,000 docs: 87,307 docs/sec | 873,068 tokens/sec
10,000 docs: 81,658 docs/sec | 816,576 tokens/sec
50,000 docs: 107,583 docs/sec | 1,075,829 tokens/sec
๐ Performance Comparison Table
| Batch Size | CRAYON Docs/Sec | CRAYON Tokens/Sec | Tiktoken Docs/Sec | Tiktoken Tokens/Sec | Speedup |
|---|---|---|---|---|---|
| 1,000 | 748,048 | 9,724,621 | 87,307 | 873,068 | 11.1x โจ |
| 10,000 | 639,239 | 8,310,109 | 81,658 | 816,576 | 10.2x โจ |
| 50,000 | 781,129 | 10,154,678 | 107,583 | 1,075,829 | 9.4x โจ |
Average Speedup: 10.2x faster than tiktoken on Tesla T4 GPU
๐ฏ Key Achievements
- โ >10M tokens/sec on mid-tier GPU (Tesla T4)
- โ Smart compilation - Only builds for detected GPU architecture
- โ Zero-copy memory mapping - Instant profile loading (<1ms)
- โ Production-grade stability - Handles 50K+ document batches
- โ Consistent performance - Minimal variance across batch sizes
โก Quick Start: The "Omni-Backend"
Run on any hardware with a single line of code. Crayon automatically detects AVX2, CUDA, or ROCm presence.
1. Hardware-Aware Initialization
from crayon.core.vocabulary import CrayonVocab
# ๐ต CPU (Intel/AMD) - AVX2/AVX-512 Native
vocab = CrayonVocab(device="cpu")
# ๐ข NVIDIA GPUs (All Tensor Core Architectures)
vocab = CrayonVocab(device="cuda")
# ๐ด AMD GPUs (Instinct/Radeon HIP/ROCm)
vocab = CrayonVocab(device="rocm")
2. The "Context Manager" Hot-Swap
Instantly switch between specialized vocabularies within the same script without reloading the model.
vocab = CrayonVocab(device="cpu")
vocab.load_profile("lite")
# ... standard tokenization ...
# โก TEMPORARY SWITCH to 'code' profile for a function block
with vocab.using_profile("code"):
tokens = vocab.tokenize("def fast_inverse_sqrt(x):")
# Uses the compact Code vocabulary here
# ๐ฅ AUTOMATICALLY REVERT to 'lite' here
3. Basic Example
import json
import mmap
from crayon.c_ext.dat_builder import DATBuilder
from crayon.c_ext import crayon_cpu # Auto-renamed from crayon_fast
# Load any trained vocabulary
with open("trained_vocab_code.json", "r") as f:
vocab_list = json.load(f)
# Compile to DAT (one-time, few seconds)
builder = DATBuilder()
builder.build(vocab_list)
builder.save("vocab_code.dat")
# Load into C++ engine via memory mapping (instant, <1ms)
with open("vocab_code.dat", "rb") as f:
mm = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
crayon_cpu.load_dat(mm)
# Ultra-fast tokenization ๐
code = 'fn main() { println!("Hello, World!"); }'
tokens = crayon_cpu.tokenize(code)
print(f"Tokens: {tokens}")
๐ฆ Installation
pip install xerv-crayon
Google Colab / Linux Installation
Since Crayon includes high-performance C++ extensions, it will compile natively on your environment:
# Run this in a Colab cell
!pip install xerv-crayon
Build the Extensions
PowerShell (Windows):
python setup.py build_ext --inplace
Bash (Linux/Mac):
python setup.py build_ext --inplace
Note: The setup script auto-detects
nvccandhipcc. If found, GPU backends are built automatically.
๐๏ธ Omni-Backend Architecture (v4.0)
Crayon now uses a "God Tier" multi-backend implementation combining:
โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ
โ vocab.json โ โโโถ โ DATCompiler โ โโโถ โ vocab.dat โ โโโถ โ Omni-Engine โ
โ (List) โ โ (C++ Fast) โ โ (Binary) โ โ CPU/CUDA/HIP โ
โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ
| Component | File | Accelerators |
|---|---|---|
| CPU Backend | c_ext/cpu_engine.cpp |
AVX-512 / AVX2 (Intel/AMD) |
| CUDA Backend | c_ext/gpu_engine_cuda.cu |
Tensor Cores (NVIDIA Tesla/Ampere) |
| ROCm Backend | c_ext/rocm_engine.cpp |
CDNA2 / RDNA3 (AMD Instinct/Radeon) |
| Zero-Copy Loader | mmap + buffer protocol |
Instant startup (0.5ms) |
๐งฉ Available Cartridges
5 production-ready profiles defined in src/crayon/core/profiles.py:
| Profile | Size | Optimized For | Sources |
|---|---|---|---|
standard |
57k | General English (V5 Default) | Lite + Top 10k subwords |
lite |
50k | Speed & Mobile | WikiText, RainDrop |
science |
250k | Reasoning (LaTeX, Quantum, Grad Math) | GRAD, Physics-700 |
code |
250k | Syntax (Python, Rust, C++, JS) | CodeParrot, The Stack |
multilingual |
250k | Global (EU langs, Chinese, Hindi) | OSCAR, Wikipedia |
arts_commerce |
250k | Business (Legal, Finance, Lit) | PG19, Fin Phrasebank |
vocab = CrayonVocab.load_profile("science")
vocab = CrayonVocab.load_profile("multilingual")
โ๏ธ Verify on Google Colab
โ Quick Verify Snippet
from crayon import CrayonVocab
# Initialize with Auto-Backend (AVX2/CUDA/ROCm)
tokenizer = CrayonVocab(device="auto")
# 1. Test Standard subword-heavy profile
tokenizer.load_profile("standard")
print(tokenizer.tokenize("that is a test for the standard profile"))
# 2. Test Code specialized profile
tokenizer.load_profile("code")
print(tokenizer.tokenize("def fast_inverse_sqrt(x):"))
๐งช Testing & Verification
# Full verification (Benchmarks + Tests)
python verify_dat_engine.py
# Benchmark all backends
python benchmark_competitive.py
============================================================
XERV CRAYON V4.1.9 - HYPER-PRODUCTION DAT ENGINE VERIFICATION
============================================================
Vocabulary Size: 250,000 tokens
DAT Nodes: 370,000+
Throughput: 40,808,299 tokens/sec
STATUS: โ
HYPER-PRODUCTION READY
๐ Citation
@techreport{xerv2026crayon,
title={XERV Crayon: A First-Principles Analysis of Production-Grade Tokenization},
author={Pal, Soham and Xerv Research},
year={2026},
institution={Xerv Research Engineering Division}
}
๐ License
Copyright (c) 2025-2026 Xerv Research. Released under the MIT License.
Built with ๐ by Xerv Research Engineering Division
โญ Star this repo if Crayon helps your project!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file xerv_crayon-5.3.0-py3-none-any.whl.
File metadata
- Download URL: xerv_crayon-5.3.0-py3-none-any.whl
- Upload date:
- Size: 9.3 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ae1df175eddd48e12ed9248b99cc995e820ac1261b010df0f16e1e66dafae078
|
|
| MD5 |
1d4c3a6208f976ac70b66c82c8ed561a
|
|
| BLAKE2b-256 |
1f80f9182c71b807d6ed9b4ea5c7378f2dd16f836e9a92aecc76857d9b51d61d
|