Skip to main content

The Omni-Backend Tokenizer (CPU/CUDA/ROCm)

Project description

Crayon Logo

๐Ÿ–๏ธ XERV Crayon v4.0

The Omni-Backend Tokenizer for Specialized AI

PyPI version License: MIT Python 3.12+ CUDA ROCm AVX2

Why force a single bloated vocabulary on every problem?
Crayon is a next-generation tokenizer designed for specialization. Hot-swap vocabulary profiles ("Cartridges") optimized for your domainโ€”Quantum Physics, Rust Programming, Financial Law, or anything in between.


๐Ÿš€ Key Features

Feature Description
๐Ÿ’พ Cartridge System Instantly hot-swap specialized vocabularies (science, code, multilingual)
๐Ÿš€ Omni-Backend Auto-detects & runs on CPU (AVX2), NVIDIA (CUDA), or AMD (ROCm)
โšก Native GPU Kernels "Bare Metal" C++/HIP kernels (no wrappers) for >100M tokens/sec
๐Ÿ—บ๏ธ Zero-Copy Mapping DAT files loaded via mmap for instant startup & minimal RAM
๐ŸŒŠ Zero-Disk Streaming Build profiles directly from Hugging Faceโ€”no multi-GB downloads
๐Ÿ›ก๏ธ Offline Resilience Seamless local bootstrap fallback. Works offline out-of-the-box

๐Ÿ“Š Benchmarks โ€” The Numbers Speak

100% HONEST. NO SUGARCOATING. DATA-DRIVEN.

Run python benchmark_competitive.py to reproduce these results yourself.

โšก Speed Comparison (Omni-Backend)

Tokenizer Tokens/sec vs CRAYON
๐Ÿ–๏ธ CRAYON (CPU - AVX2) 21,863,777 baseline
๐Ÿ–๏ธ CRAYON (CUDA - A100) 140,000,000+ 6.4x faster
tiktoken (GPT-4) 524,469 41x slower
HF LLaMA (SP-BPE) 281,558 77x slower
HF GPT-2 (BPE) 237,117 92x slower
HF BERT (WordPiece) 202,269 108x slower

๐Ÿ“ˆ CPU Optimization Verification

Measured on Intel Core i3-7020U (Low-Power Laptop CPU)

Metric Result
โœ… AVX2 Status Active (Simd-Ops v4)
โœ… Load Time 0.54ms (Instant hot-swap)
โœ… Throughput 21.1M tokens/sec (!?!)

Benchmark Comparison


โšก Quick Start: The "Omni-Backend"

Run on any hardware with a single line of code. Crayon automatically detects AVX2, CUDA, or ROCm presence.

1. Hardware-Aware Initialization

from crayon.core.vocabulary import CrayonVocab

# ๐Ÿ”ต CPU (Intel/AMD) - AVX2/AVX-512 Native
vocab = CrayonVocab(device="cpu")

# ๐ŸŸข NVIDIA GPUs (All Tensor Core Architectures)
vocab = CrayonVocab(device="cuda")

# ๐Ÿ”ด AMD GPUs (Instinct/Radeon HIP/ROCm)
vocab = CrayonVocab(device="rocm")

2. The "Context Manager" Hot-Swap

Instantly switch between specialized vocabularies within the same script without reloading the model.

vocab = CrayonVocab(device="cpu")
vocab.load_profile("lite")

# ... standard tokenization ...

# โšก TEMPORARY SWITCH to 'code' profile for a function block
with vocab.using_profile("code"):
    tokens = vocab.tokenize("def fast_inverse_sqrt(x):")
    # Uses the compact Code vocabulary here
    
# ๐Ÿ”ฅ AUTOMATICALLY REVERT to 'lite' here

3. Basic Example

import json
import mmap
from crayon.c_ext.dat_builder import DATBuilder
from crayon.c_ext import crayon_cpu # Auto-renamed from crayon_fast

# Load any trained vocabulary
with open("trained_vocab_code.json", "r") as f:
    vocab_list = json.load(f)

# Compile to DAT (one-time, few seconds)
builder = DATBuilder()
builder.build(vocab_list)
builder.save("vocab_code.dat")

# Load into C++ engine via memory mapping (instant, <1ms)
with open("vocab_code.dat", "rb") as f:
    mm = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
    crayon_cpu.load_dat(mm)

# Ultra-fast tokenization ๐Ÿš€
code = 'fn main() { println!("Hello, World!"); }'
tokens = crayon_cpu.tokenize(code)
print(f"Tokens: {tokens}")

๐Ÿ“ฆ Installation

git clone https://github.com/Xerv-AI/crayon.git
cd crayon
pip install -e .

Build the Extensions

PowerShell (Windows):

python setup.py build_ext --inplace

Bash (Linux/Mac):

python setup.py build_ext --inplace

Note: The setup script auto-detects nvcc and hipcc. If found, GPU backends are built automatically.


๐ŸŽ๏ธ Omni-Backend Architecture (v4.0)

Crayon now uses a "God Tier" multi-backend implementation combining:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”      โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”      โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”      โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ vocab.json  โ”‚ โ”€โ”€โ–ถ  โ”‚ DATBuilder   โ”‚ โ”€โ”€โ–ถ  โ”‚  vocab.dat  โ”‚ โ”€โ”€โ–ถ  โ”‚ Omni-Engine  โ”‚
โ”‚   (List)    โ”‚      โ”‚  (Python)    โ”‚      โ”‚  (Binary)   โ”‚      โ”‚ CPU/CUDA/HIP โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜      โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜      โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜      โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
Component File Accelerators
CPU Backend c_ext/cpu_engine.cpp AVX-512 / AVX2 (Intel/AMD)
CUDA Backend c_ext/gpu_engine_cuda.cu Tensor Cores (NVIDIA Tesla/Ampere)
ROCm Backend c_ext/rocm_engine.cpp CDNA2 / RDNA3 (AMD Instinct/Radeon)
Zero-Copy Loader mmap + buffer protocol Instant startup (0.5ms)

๐Ÿงฉ Available Cartridges

5 production-ready profiles defined in src/crayon/core/profiles.py:

Profile Size Optimized For Sources
lite 50k Speed & Mobile WikiText, RainDrop
science 250k Reasoning (LaTeX, Quantum, Grad Math) GRAD, Physics-700
code 250k Syntax (Python, Rust, C++, JS) CodeParrot, The Stack
multilingual 250k Global (EU langs, Chinese, Hindi) OSCAR, Wikipedia
arts_commerce 250k Business (Legal, Finance, Lit) PG19, Fin Phrasebank
vocab = CrayonVocab.load_profile("science")
vocab = CrayonVocab.load_profile("multilingual")

โ˜๏ธ Verify on Google Colab

Want to test the CUDA Backend for free?

Open In Colab

  1. Open the notebook.
  2. Change Runtime type to T4 GPU.
  3. Run the cells to verify crayon_cuda compiles and smashes tokens at >100M/sec.

๐Ÿงช Testing & Verification

# Full verification (Benchmarks + Tests)
python verify_dat_engine.py

# Benchmark all backends
python benchmark_competitive.py
============================================================
XERV CRAYON V2.0 - HYPER-PRODUCTION DAT ENGINE VERIFICATION
============================================================
Vocabulary Size: 50,000 tokens
DAT Nodes: 163,000+
Throughput: 14,255,305 tokens/sec
STATUS: โœ… HYPER-PRODUCTION READY

๐Ÿ“œ Citation

@techreport{xerv2026crayon,
  title={XERV Crayon: A First-Principles Analysis of Production-Grade Tokenization},
  author={Pal, Soham and Xerv Research},
  year={2026},
  institution={Xerv Research Engineering Division}
}

๐Ÿ“„ License

Copyright (c) 2025-2026 Xerv Research. Released under the MIT License.


Built with ๐Ÿ’™ by Xerv Research Engineering Division

โญ Star this repo if Crayon helps your project!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

xerv_crayon-4.0.8.tar.gz (5.9 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

xerv_crayon-4.0.8-cp313-cp313-win_amd64.whl (6.0 MB view details)

Uploaded CPython 3.13Windows x86-64

File details

Details for the file xerv_crayon-4.0.8.tar.gz.

File metadata

  • Download URL: xerv_crayon-4.0.8.tar.gz
  • Upload date:
  • Size: 5.9 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.1

File hashes

Hashes for xerv_crayon-4.0.8.tar.gz
Algorithm Hash digest
SHA256 12fbf2474243b766f1a826105b547a5c3a70bf4a6ca65ff959e7017b556ffeec
MD5 bf4159ebadca24267a88f3ce6a9a07bb
BLAKE2b-256 080fe5acaf0b08fc9376686e1d66a5a6ca793993c19255742da134171a932a06

See more details on using hashes here.

File details

Details for the file xerv_crayon-4.0.8-cp313-cp313-win_amd64.whl.

File metadata

File hashes

Hashes for xerv_crayon-4.0.8-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 1298d798603b86417b947963823d31233368f8e1e11a2c08276089488a1a7ac5
MD5 b89223ac491aa8f0541cee1c9236dff3
BLAKE2b-256 29fadeb75d83626bdd7cd094d915a24bf66ebdd7349f4674bf863e267b89f532

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page