Skip to main content

The Omni-Backend Tokenizer (CPU/CUDA/ROCm)

Project description

Crayon Logo

๐Ÿ–๏ธ XERV Crayon v4.0

The Omni-Backend Tokenizer for Specialized AI

PyPI version License: MIT Python 3.12+ CUDA ROCm AVX2

Why force a single bloated vocabulary on every problem?
Crayon is a next-generation tokenizer designed for specialization. Hot-swap vocabulary profiles ("Cartridges") optimized for your domainโ€”Quantum Physics, Rust Programming, Financial Law, or anything in between.


๐Ÿš€ Key Features

Feature Description
๐Ÿ’พ Cartridge System Instantly hot-swap specialized vocabularies (science, code, multilingual)
๐Ÿš€ Omni-Backend Auto-detects & runs on CPU (AVX2), NVIDIA (CUDA), or AMD (ROCm)
โšก Native GPU Kernels "Bare Metal" C++/HIP kernels (no wrappers) for >100M tokens/sec
๐Ÿ—บ๏ธ Zero-Copy Mapping DAT files loaded via mmap for instant startup & minimal RAM
๐ŸŒŠ Zero-Disk Streaming Build profiles directly from Hugging Faceโ€”no multi-GB downloads
๐Ÿ›ก๏ธ Offline Resilience Seamless local bootstrap fallback. Works offline out-of-the-box

๐Ÿ“Š Benchmarks โ€” The Numbers Speak

100% HONEST. NO SUGARCOATING. DATA-DRIVEN.

Run python benchmark_competitive.py to reproduce these results yourself.

โšก Speed Comparison (Omni-Backend)

Tokenizer Tokens/sec vs CRAYON
๐Ÿ–๏ธ CRAYON (CPU - AVX2) 21,863,777 baseline
๐Ÿ–๏ธ CRAYON (CUDA - A100) 140,000,000+ 6.4x faster
tiktoken (GPT-4) 524,469 41x slower
HF LLaMA (SP-BPE) 281,558 77x slower
HF GPT-2 (BPE) 237,117 92x slower
HF BERT (WordPiece) 202,269 108x slower

๐Ÿ“ˆ CPU Optimization Verification

Measured on Intel Core i3-7020U (Low-Power Laptop CPU)

Metric Result
โœ… AVX2 Status Active (Simd-Ops v4)
โœ… Load Time 0.54ms (Instant hot-swap)
โœ… Throughput 21.1M tokens/sec (!?!)

Benchmark Comparison


โšก Quick Start: The "Omni-Backend"

Run on any hardware with a single line of code. Crayon automatically detects AVX2, CUDA, or ROCm presence.

1. Hardware-Aware Initialization

from crayon.core.vocabulary import CrayonVocab

# ๐Ÿ”ต CPU (Intel/AMD) - AVX2/AVX-512 Native
vocab = CrayonVocab(device="cpu")

# ๐ŸŸข NVIDIA GPUs (All Tensor Core Architectures)
vocab = CrayonVocab(device="cuda")

# ๐Ÿ”ด AMD GPUs (Instinct/Radeon HIP/ROCm)
vocab = CrayonVocab(device="rocm")

2. The "Context Manager" Hot-Swap

Instantly switch between specialized vocabularies within the same script without reloading the model.

vocab = CrayonVocab(device="cpu")
vocab.load_profile("lite")

# ... standard tokenization ...

# โšก TEMPORARY SWITCH to 'code' profile for a function block
with vocab.using_profile("code"):
    tokens = vocab.tokenize("def fast_inverse_sqrt(x):")
    # Uses the compact Code vocabulary here
    
# ๐Ÿ”ฅ AUTOMATICALLY REVERT to 'lite' here

3. Basic Example

import json
import mmap
from crayon.c_ext.dat_builder import DATBuilder
from crayon.c_ext import crayon_cpu # Auto-renamed from crayon_fast

# Load any trained vocabulary
with open("trained_vocab_code.json", "r") as f:
    vocab_list = json.load(f)

# Compile to DAT (one-time, few seconds)
builder = DATBuilder()
builder.build(vocab_list)
builder.save("vocab_code.dat")

# Load into C++ engine via memory mapping (instant, <1ms)
with open("vocab_code.dat", "rb") as f:
    mm = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
    crayon_cpu.load_dat(mm)

# Ultra-fast tokenization ๐Ÿš€
code = 'fn main() { println!("Hello, World!"); }'
tokens = crayon_cpu.tokenize(code)
print(f"Tokens: {tokens}")

๐Ÿ“ฆ Installation

git clone https://github.com/Xerv-AI/crayon.git
cd crayon
pip install -e .

Build the Extensions

PowerShell (Windows):

python setup.py build_ext --inplace

Bash (Linux/Mac):

python setup.py build_ext --inplace

Note: The setup script auto-detects nvcc and hipcc. If found, GPU backends are built automatically.


๐ŸŽ๏ธ Omni-Backend Architecture (v4.0)

Crayon now uses a "God Tier" multi-backend implementation combining:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”      โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”      โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”      โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ vocab.json  โ”‚ โ”€โ”€โ–ถ  โ”‚ DATBuilder   โ”‚ โ”€โ”€โ–ถ  โ”‚  vocab.dat  โ”‚ โ”€โ”€โ–ถ  โ”‚ Omni-Engine  โ”‚
โ”‚   (List)    โ”‚      โ”‚  (Python)    โ”‚      โ”‚  (Binary)   โ”‚      โ”‚ CPU/CUDA/HIP โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜      โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜      โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜      โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
Component File Accelerators
CPU Backend c_ext/cpu_engine.cpp AVX-512 / AVX2 (Intel/AMD)
CUDA Backend c_ext/gpu_engine_cuda.cu Tensor Cores (NVIDIA Tesla/Ampere)
ROCm Backend c_ext/rocm_engine.cpp CDNA2 / RDNA3 (AMD Instinct/Radeon)
Zero-Copy Loader mmap + buffer protocol Instant startup (0.5ms)

๐Ÿงฉ Available Cartridges

5 production-ready profiles defined in src/crayon/core/profiles.py:

Profile Size Optimized For Sources
lite 50k Speed & Mobile WikiText, RainDrop
science 250k Reasoning (LaTeX, Quantum, Grad Math) GRAD, Physics-700
code 250k Syntax (Python, Rust, C++, JS) CodeParrot, The Stack
multilingual 250k Global (EU langs, Chinese, Hindi) OSCAR, Wikipedia
arts_commerce 250k Business (Legal, Finance, Lit) PG19, Fin Phrasebank
vocab = CrayonVocab.load_profile("science")
vocab = CrayonVocab.load_profile("multilingual")

โ˜๏ธ Verify on Google Colab

Want to test the CUDA Backend for free?

Open In Colab

  1. Open the notebook.
  2. Change Runtime type to T4 GPU.
  3. Run the cells to verify crayon_cuda compiles and smashes tokens at >100M/sec.

๐Ÿงช Testing & Verification

# Full verification (Benchmarks + Tests)
python verify_dat_engine.py

# Benchmark all backends
python benchmark_competitive.py
============================================================
XERV CRAYON V2.0 - HYPER-PRODUCTION DAT ENGINE VERIFICATION
============================================================
Vocabulary Size: 50,000 tokens
DAT Nodes: 163,000+
Throughput: 14,255,305 tokens/sec
STATUS: โœ… HYPER-PRODUCTION READY

๐Ÿ“œ Citation

@techreport{xerv2026crayon,
  title={XERV Crayon: A First-Principles Analysis of Production-Grade Tokenization},
  author={Pal, Soham and Xerv Research},
  year={2026},
  institution={Xerv Research Engineering Division}
}

๐Ÿ“„ License

Copyright (c) 2025-2026 Xerv Research. Released under the MIT License.


Built with ๐Ÿ’™ by Xerv Research Engineering Division

โญ Star this repo if Crayon helps your project!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

xerv_crayon-4.1.5.tar.gz (5.9 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

xerv_crayon-4.1.5-cp313-cp313-win_amd64.whl (6.0 MB view details)

Uploaded CPython 3.13Windows x86-64

File details

Details for the file xerv_crayon-4.1.5.tar.gz.

File metadata

  • Download URL: xerv_crayon-4.1.5.tar.gz
  • Upload date:
  • Size: 5.9 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.1

File hashes

Hashes for xerv_crayon-4.1.5.tar.gz
Algorithm Hash digest
SHA256 31893cc881a39b6a06b3d7e245662f093cc9db2a410b1f6b1b06b396f5373e4d
MD5 139664d66ef0038de6d6ff0a6e9e4f1e
BLAKE2b-256 0cc48fc583482a2b0ff36236e3eb1f3d973b796e4c437aae3d2201648b6d43d5

See more details on using hashes here.

File details

Details for the file xerv_crayon-4.1.5-cp313-cp313-win_amd64.whl.

File metadata

File hashes

Hashes for xerv_crayon-4.1.5-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 dba1af854275e0b47fb635e12c342619a130916b02a40a9ae6d2661c96583d02
MD5 5a849eedacde1fd4af62fb4a488f8578
BLAKE2b-256 3867f420a350a1f127a7056e55b1579b864606af5561f25a9fcee8d66933010a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page