Skip to main content

The Omni-Backend Tokenizer (CPU/CUDA/ROCm)

Project description

Crayon Logo

๐Ÿ–๏ธ XERV Crayon v4.0

The Omni-Backend Tokenizer for Specialized AI

PyPI version License: MIT Python 3.12+ CUDA ROCm AVX2

Why force a single bloated vocabulary on every problem?
Crayon is a next-generation tokenizer designed for specialization. Hot-swap vocabulary profiles ("Cartridges") optimized for your domainโ€”Quantum Physics, Rust Programming, Financial Law, or anything in between.


๐Ÿš€ Key Features

Feature Description
๐Ÿ’พ Cartridge System Instantly hot-swap specialized vocabularies (science, code, multilingual)
๐Ÿš€ Omni-Backend Auto-detects & runs on CPU (AVX2), NVIDIA (CUDA), or AMD (ROCm)
โšก Native GPU Kernels "Bare Metal" C++/HIP kernels (no wrappers) for >100M tokens/sec
๐Ÿ—บ๏ธ Zero-Copy Mapping DAT files loaded via mmap for instant startup & minimal RAM
๐ŸŒŠ Zero-Disk Streaming Build profiles directly from Hugging Faceโ€”no multi-GB downloads
๐Ÿ›ก๏ธ Offline Resilience Seamless local bootstrap fallback. Works offline out-of-the-box

๐Ÿ“Š Benchmarks โ€” The Numbers Speak

100% HONEST. NO SUGARCOATING. DATA-DRIVEN.

Run python benchmark_competitive.py to reproduce these results yourself.

โšก Speed Comparison (Omni-Backend)

Tokenizer Tokens/sec vs CRAYON
๐Ÿ–๏ธ CRAYON (CPU - AVX2) 21,863,777 baseline
๐Ÿ–๏ธ CRAYON (CUDA - A100) 140,000,000+ 6.4x faster
tiktoken (GPT-4) 524,469 41x slower
HF LLaMA (SP-BPE) 281,558 77x slower
HF GPT-2 (BPE) 237,117 92x slower
HF BERT (WordPiece) 202,269 108x slower

๐Ÿ“ˆ CPU Optimization Verification

Measured on Intel Core i3-7020U (Low-Power Laptop CPU)

Metric Result
โœ… AVX2 Status Active (Simd-Ops v4)
โœ… Load Time 0.54ms (Instant hot-swap)
โœ… Throughput 21.1M tokens/sec (!?!)

Benchmark Comparison


โšก Quick Start: The "Omni-Backend"

Run on any hardware with a single line of code. Crayon automatically detects AVX2, CUDA, or ROCm presence.

1. Hardware-Aware Initialization

from crayon.core.vocabulary import CrayonVocab

# ๐Ÿ”ต CPU (Intel/AMD) - AVX2/AVX-512 Native
vocab = CrayonVocab(device="cpu")

# ๐ŸŸข NVIDIA GPUs (All Tensor Core Architectures)
vocab = CrayonVocab(device="cuda")

# ๐Ÿ”ด AMD GPUs (Instinct/Radeon HIP/ROCm)
vocab = CrayonVocab(device="rocm")

2. The "Context Manager" Hot-Swap

Instantly switch between specialized vocabularies within the same script without reloading the model.

vocab = CrayonVocab(device="cpu")
vocab.load_profile("lite")

# ... standard tokenization ...

# โšก TEMPORARY SWITCH to 'code' profile for a function block
with vocab.using_profile("code"):
    tokens = vocab.tokenize("def fast_inverse_sqrt(x):")
    # Uses the compact Code vocabulary here
    
# ๐Ÿ”ฅ AUTOMATICALLY REVERT to 'lite' here

3. Basic Example

import json
import mmap
from crayon.c_ext.dat_builder import DATBuilder
from crayon.c_ext import crayon_cpu # Auto-renamed from crayon_fast

# Load any trained vocabulary
with open("trained_vocab_code.json", "r") as f:
    vocab_list = json.load(f)

# Compile to DAT (one-time, few seconds)
builder = DATBuilder()
builder.build(vocab_list)
builder.save("vocab_code.dat")

# Load into C++ engine via memory mapping (instant, <1ms)
with open("vocab_code.dat", "rb") as f:
    mm = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
    crayon_cpu.load_dat(mm)

# Ultra-fast tokenization ๐Ÿš€
code = 'fn main() { println!("Hello, World!"); }'
tokens = crayon_cpu.tokenize(code)
print(f"Tokens: {tokens}")

๐Ÿ“ฆ Installation

git clone https://github.com/Xerv-AI/crayon.git
cd crayon
pip install -e .

Build the Extensions

PowerShell (Windows):

python setup.py build_ext --inplace

Bash (Linux/Mac):

python setup.py build_ext --inplace

Note: The setup script auto-detects nvcc and hipcc. If found, GPU backends are built automatically.


๐ŸŽ๏ธ Omni-Backend Architecture (v4.0)

Crayon now uses a "God Tier" multi-backend implementation combining:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”      โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”      โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”      โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ vocab.json  โ”‚ โ”€โ”€โ–ถ  โ”‚ DATBuilder   โ”‚ โ”€โ”€โ–ถ  โ”‚  vocab.dat  โ”‚ โ”€โ”€โ–ถ  โ”‚ Omni-Engine  โ”‚
โ”‚   (List)    โ”‚      โ”‚  (Python)    โ”‚      โ”‚  (Binary)   โ”‚      โ”‚ CPU/CUDA/HIP โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜      โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜      โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜      โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
Component File Accelerators
CPU Backend c_ext/cpu_engine.cpp AVX-512 / AVX2 (Intel/AMD)
CUDA Backend c_ext/gpu_engine_cuda.cu Tensor Cores (NVIDIA Tesla/Ampere)
ROCm Backend c_ext/rocm_engine.cpp CDNA2 / RDNA3 (AMD Instinct/Radeon)
Zero-Copy Loader mmap + buffer protocol Instant startup (0.5ms)

๐Ÿงฉ Available Cartridges

5 production-ready profiles defined in src/crayon/core/profiles.py:

Profile Size Optimized For Sources
lite 50k Speed & Mobile WikiText, RainDrop
science 250k Reasoning (LaTeX, Quantum, Grad Math) GRAD, Physics-700
code 250k Syntax (Python, Rust, C++, JS) CodeParrot, The Stack
multilingual 250k Global (EU langs, Chinese, Hindi) OSCAR, Wikipedia
arts_commerce 250k Business (Legal, Finance, Lit) PG19, Fin Phrasebank
vocab = CrayonVocab.load_profile("science")
vocab = CrayonVocab.load_profile("multilingual")

โ˜๏ธ Verify on Google Colab

Want to test the CUDA Backend for free?

Open In Colab

  1. Open the notebook.
  2. Change Runtime type to T4 GPU.
  3. Run the cells to verify crayon_cuda compiles and smashes tokens at >100M/sec.

๐Ÿงช Testing & Verification

# Full verification (Benchmarks + Tests)
python verify_dat_engine.py

# Benchmark all backends
python benchmark_competitive.py
============================================================
XERV CRAYON V2.0 - HYPER-PRODUCTION DAT ENGINE VERIFICATION
============================================================
Vocabulary Size: 50,000 tokens
DAT Nodes: 163,000+
Throughput: 14,255,305 tokens/sec
STATUS: โœ… HYPER-PRODUCTION READY

๐Ÿ“œ Citation

@techreport{xerv2026crayon,
  title={XERV Crayon: A First-Principles Analysis of Production-Grade Tokenization},
  author={Pal, Soham and Xerv Research},
  year={2026},
  institution={Xerv Research Engineering Division}
}

๐Ÿ“„ License

Copyright (c) 2025-2026 Xerv Research. Released under the MIT License.


Built with ๐Ÿ’™ by Xerv Research Engineering Division

โญ Star this repo if Crayon helps your project!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

xerv_crayon-4.0.7.tar.gz (5.9 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

xerv_crayon-4.0.7-cp313-cp313-win_amd64.whl (6.0 MB view details)

Uploaded CPython 3.13Windows x86-64

File details

Details for the file xerv_crayon-4.0.7.tar.gz.

File metadata

  • Download URL: xerv_crayon-4.0.7.tar.gz
  • Upload date:
  • Size: 5.9 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.1

File hashes

Hashes for xerv_crayon-4.0.7.tar.gz
Algorithm Hash digest
SHA256 3c1291c29ffc8bc2137e8b7b2772d2ef593b855df90518f5187d765f68fa753e
MD5 17012d53b8655185e5db1fb6cf284d4e
BLAKE2b-256 c3c4d2b666daa3dada94cdc112a890e9e7d98c323c1ef69b57ace7f17e5cd74c

See more details on using hashes here.

File details

Details for the file xerv_crayon-4.0.7-cp313-cp313-win_amd64.whl.

File metadata

File hashes

Hashes for xerv_crayon-4.0.7-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 cb8339167b3e5d10677fbb339c3563058dc01922e04a71bad8f0e3f1b22a2bd7
MD5 e6326fe519a360dcae2446f8d791011f
BLAKE2b-256 ca7c73c59981fb3d44fe037439011c1394ebd269447550faa662e205505a04c6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page