Skip to main content

Production-grade tokenizer achieving >16M tokens/s via AVX2/SIMD optimizations and Double-Array Trie engine.

Project description

Crayon Logo

๐Ÿ–๏ธ XERV Crayon

The Cartridge-Based Tokenizer for Specialized AI

PyPI version License: MIT Python 3.12+ AVX2 Build Status

Why force a single bloated vocabulary on every problem?
Crayon is a next-generation tokenizer designed for specialization. Hot-swap vocabulary profiles ("Cartridges") optimized for your domainโ€”Quantum Physics, Rust Programming, Financial Law, or anything in between.


๐Ÿš€ Key Features

Feature Description
๐Ÿ’พ Cartridge System Instantly hot-swap specialized vocabularies (science, code, multilingual)
โšก AVX2 Double-Array Trie Validated ~10M tokens/sec via SIMD-accelerated branchless tokenization
๐Ÿ—บ๏ธ Zero-Copy Memory Mapping DAT files loaded via mmap for instant startup & minimal RAM
๐ŸŒŠ Zero-Disk Streaming Build profiles directly from Hugging Faceโ€”no multi-GB downloads
๐Ÿ›ก๏ธ Offline Resilience Seamless local bootstrap fallback. Works offline out-of-the-box
๐Ÿง  Entropy-Guided Construction Information-theoretic token selection for maximum domain efficiency

๐Ÿ“Š Benchmarks โ€” The Numbers Speak

100% HONEST. NO SUGARCOATING. DATA-DRIVEN.

Run python benchmark_competitive.py to reproduce these results yourself.

โšก Speed Comparison

Tokenizer Tokens/sec vs CRAYON
๐Ÿ–๏ธ CRAYON (lite, 50k) 6,010,525 baseline
tiktoken (GPT-4) 524,469 11.5x slower
tiktoken (GPT-3) 466,823 12.9x slower
HF LLaMA (SP-BPE) 281,558 21.3x slower
HF GPT-2 (BPE) 237,117 25.3x slower
HF BERT (WordPiece) 202,269 29.7x slower
HF T5 (SentencePiece) 189,928 31.6x slower

๐Ÿ“ˆ Full Benchmark Results

Tokenizer Vocab Size Tokens/sec MB/sec Load Time Avg Time
CRAYON (lite, 50k) 50,000 6,010,525 15.33 0.54ms 4.56ms
tiktoken (cl100k/GPT-4) 100,000 524,469 2.18 0.01ms 32.03ms
tiktoken (p50k/GPT-3) 50,000 466,823 1.55 0.00ms 44.98ms
HF LLaMA (SP-BPE) 32,000 281,558 0.95 1212.02ms 73.52ms
HF GPT-2 (BPE) 50,257 237,117 0.69 2051.18ms 100.79ms
HF BERT (WordPiece) 30,522 202,269 0.73 1603.10ms 95.43ms
HF T5 (SentencePiece) 32,000 189,928 0.68 1727.91ms 102.15ms
๐Ÿ“‹ Test Environment & Methodology
  • Platform: Windows AMD64, Python 3.13.1
  • Test Text: 68.4 KB mixed content (code, prose, multilingual)
  • Iterations: 10 runs + 2 warmup per tokenizer
  • Full methodology: BENCHMARK_RESULTS.md

๐Ÿ† Key Takeaways

Metric Result
โœ… vs tiktoken (GPT-4) 11.5x faster
โœ… vs HuggingFace GPT-2 25x faster
โœ… Load time 0.54ms (vs 1-2s for HuggingFace)
โœ… Peak throughput 10.4M tokens/sec (science profile)

Benchmark Comparison


โšก Quick Start

Get tokenizing in under 60 seconds:

Option 1: Direct DAT Compilation

import json
import mmap
from crayon.c_ext.dat_builder import DATBuilder
from crayon.c_ext import crayon_fast

# Load any trained vocabulary
with open("trained_vocab_code.json", "r") as f:
    vocab_list = json.load(f)

# Compile to DAT (one-time, few seconds)
builder = DATBuilder()
builder.build(vocab_list)
builder.save("vocab_code.dat")

# Load into C++ engine via memory mapping (instant, <1ms)
with open("vocab_code.dat", "rb") as f:
    mm = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
    crayon_fast.load_dat(mm)

# Ultra-fast tokenization ๐Ÿš€
code = 'fn main() { println!("Hello, World!"); }'
tokens = crayon_fast.tokenize(code)
print(f"Tokens: {tokens}")

Option 2: Profile System (Recommended)

from crayon.core.vocabulary import CrayonVocab

# Load pre-compiled profile (requires one-time compile_profiles.py)
vocab = CrayonVocab.load_profile("code")
tokens = vocab.tokenize("fn main() { }")
decoded = vocab.decode(tokens)
print(f"Decoded: {decoded}")

๐Ÿ“ฆ Installation

git clone https://github.com/Xerv-AI/crayon.git
cd crayon
pip install -e .

Build the AVX2 Extension

python setup.py build_ext --inplace

Note: Requires a C++ compiler (MSVC on Windows, GCC/Clang on Linux/Mac).

๐Ÿ”ง One-Time Setup: Compile Profiles

# Builds .dat files โ†’ ~/.cache/xerv/crayon/profiles/
python compile_profiles.py

Each profile takes 38ms-26s depending on size. See DAT_BUILDING_EXPLAINED.md for details.

๐Ÿงช Verify Installation

python demo_tokenize.py

Expected output:

[1] Loading 'lite' profile...
    Status: ๐Ÿš€ Fast C++ DAT Engine
[2] Tokenizing: 'Hello, world! This is Crayon.'
    Tokens IDs: [...]

๐ŸŽ๏ธ DAT Engine V2 Architecture

Crayon V2 uses a "God Tier" implementation combining:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”      โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”      โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”      โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ vocab.json  โ”‚ โ”€โ”€โ–ถ  โ”‚ DATBuilder   โ”‚ โ”€โ”€โ–ถ  โ”‚  vocab.dat  โ”‚ โ”€โ”€โ–ถ  โ”‚  C++ Engine  โ”‚
โ”‚   (List)    โ”‚      โ”‚  (Python)    โ”‚      โ”‚  (Binary)   โ”‚      โ”‚   (AVX2)     โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜      โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜      โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜      โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
Component File Purpose
Offline Compiler dat_builder.py First-Fit algorithm โ†’ compact DAT binary
AVX2 Runtime engine.cpp Branchless state transitions + SIMD parallel ASCII
Zero-Copy Loader mmap + buffer protocol Instant startup, minimal RAM

๐Ÿงฉ Available Cartridges

5 production-ready profiles defined in src/crayon/core/profiles.py:

Profile Size Optimized For Sources
lite 50k Speed & Mobile WikiText, RainDrop
science 250k Reasoning (LaTeX, Quantum, Grad Math) GRAD, Physics-700
code 250k Syntax (Python, Rust, C++, JS) CodeParrot, The Stack
multilingual 250k Global (EU langs, Chinese, Hindi) OSCAR, Wikipedia
arts_commerce 250k Business (Legal, Finance, Lit) PG19, Fin Phrasebank
vocab = CrayonVocab.load_profile("science")
vocab = CrayonVocab.load_profile("multilingual")

๐Ÿ› ๏ธ Advanced Usage

Compile Vocabulary to DAT Format
from crayon.c_ext.dat_builder import DATBuilder
import json

with open("trained_vocab_lite.json", "r") as f:
    vocab = json.load(f)

builder = DATBuilder()
builder.build(vocab)
builder.save("vocab_lite.dat")
Direct C++ Engine Access
import mmap
from crayon.c_ext import crayon_fast

with open("vocab_lite.dat", "rb") as f:
    mm = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
    crayon_fast.load_dat(mm)

tokens = crayon_fast.tokenize("Your text here")
Force Rebuild / Offline Mode
# Rebuild from local resources only (fastest)
vocab = CrayonVocab.load_profile("arts_commerce", force_rebuild=True)

๐Ÿ—๏ธ Architecture

Layer File Purpose
Builder c_ext/dat_builder.py Offline DAT compiler
Engine c_ext/engine.cpp AVX2 SIMD runtime
Config core/profiles.py Cartridge definitions
Resources resources.py Streaming, fallbacks, caching

For a deep dive, read the Engineering Treatise.


๐Ÿงช Testing

# All tests
python -m pytest tests/ -v

# DAT engine tests
python -m pytest tests/test_c_ext.py -v

14/14 tests pass: DATBuilder, C++ module, full pipeline, Python fallback.

๐Ÿ”ฌ DAT Engine Verification

python verify_dat_engine.py
============================================================
XERV CRAYON V2.0 - HYPER-PRODUCTION DAT ENGINE VERIFICATION
============================================================
Vocabulary Size: 50,000 tokens
DAT Nodes: 163,000+
Throughput: 9,786,707 tokens/sec
STATUS: โœ… HYPER-PRODUCTION READY

๐Ÿ“Š Training Data

Dataset Size Samples Domain
Tiny Shakespeare 1.06 MB 1 (Full) Classical Literature
RainDrop-DTS 179 KB 3,210 Instruction Following
Physics 332 KB 700 Scientific Reasoning
GRAD Math 5.00 MB 500* Graduate Mathematics
TOTAL ~6.56 MB 4,411 Curated Corpus

*GRAD dataset limited to 500 high-density samples for efficient default build.


๐Ÿงฉ API Reference

CrayonVocab
# Constructors
CrayonVocab(tokens: List[str], unk_token: str = "<UNK>")
CrayonVocab.from_corpus(corpus: str, target_size: int = 500000)
CrayonVocab.from_default_sources(vocab_size: int = 500000)
CrayonVocab.from_file(path: str)
CrayonVocab.from_json(path: str)
CrayonVocab.load_profile(name: str)  # Load cached DAT profiles

# Methods
vocab.tokenize(text: str) -> List[int]
vocab.decode(token_ids: List[int]) -> str
vocab.save(path: str, format: str = "txt")
DAT Builder
from crayon.c_ext.dat_builder import DATBuilder

builder = DATBuilder()
builder.build(vocab_list: List[str])
builder.save(output_path: str)
C++ Engine
from crayon.c_ext import crayon_fast

crayon_fast.load_dat(buffer)  # bytes, mmap, or memoryview
crayon_fast.tokenize(text: str) -> List[int]
Utilities
from crayon import check_c_extension, check_resources

print(check_c_extension())  # True/False
print(check_resources())     # Available data sources

๐Ÿค Contributing

We welcome contributions! Whether it's new cartridges, performance optimizations, or bug fixesโ€”open an issue or submit a PR.


๐Ÿ“œ Citation

@techreport{xerv2026crayon,
  title={XERV Crayon: A First-Principles Analysis of Production-Grade Tokenization},
  author={Pal, Soham and Xerv Research},
  year={2026},
  institution={Xerv Research Engineering Division}
}

๐Ÿ“„ License

Copyright (c) 2025-2026 Xerv Research. Released under the MIT License.


Built with ๐Ÿ’™ by Xerv Research Engineering Division

โญ Star this repo if Crayon helps your project!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

xerv_crayon-2.0.3.tar.gz (7.5 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

xerv_crayon-2.0.3-cp313-cp313-win_amd64.whl (6.0 MB view details)

Uploaded CPython 3.13Windows x86-64

File details

Details for the file xerv_crayon-2.0.3.tar.gz.

File metadata

  • Download URL: xerv_crayon-2.0.3.tar.gz
  • Upload date:
  • Size: 7.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.1

File hashes

Hashes for xerv_crayon-2.0.3.tar.gz
Algorithm Hash digest
SHA256 b266cc659734167cddac76ff782e791a21ab6ca220d7bbb2cdbe0c52c6d3725d
MD5 19b7cf334d0f36ea32e82ab7713d6e0b
BLAKE2b-256 b714386810006ee2ec33cc3448fa7a190b5eaf13ca75aa73562eaa12b524c19e

See more details on using hashes here.

File details

Details for the file xerv_crayon-2.0.3-cp313-cp313-win_amd64.whl.

File metadata

File hashes

Hashes for xerv_crayon-2.0.3-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 c0cd245f795ce5ca44bfb8ddf11e0f662b48472ae0e52a936bc9df73f5c00f2b
MD5 e1b84cf4a503cc79f4fa79f73d35061f
BLAKE2b-256 e5e0def2e5137d9e48e661b430d0b073fa07bf2c433101925be51571678fb3aa

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page