Production-grade tokenizer achieving >16M tokens/s via AVX2/SIMD optimizations and Double-Array Trie engine.

These details have not been verified by PyPI

Project links

Project description

Crayon Logo

🖍️ XERV Crayon

The Cartridge-Based Tokenizer for Specialized AI

Why force a single bloated vocabulary on every problem?
Crayon is a next-generation tokenizer designed for specialization. Hot-swap vocabulary profiles ("Cartridges") optimized for your domain—Quantum Physics, Rust Programming, Financial Law, or anything in between.

🚀 Key Features

Feature	Description
💾 Cartridge System	Instantly hot-swap specialized vocabularies (`science`, `code`, `multilingual`)
⚡ AVX2 Double-Array Trie	Validated ~10M tokens/sec via SIMD-accelerated branchless tokenization
🗺️ Zero-Copy Memory Mapping	DAT files loaded via `mmap` for instant startup & minimal RAM
🌊 Zero-Disk Streaming	Build profiles directly from Hugging Face—no multi-GB downloads
🛡️ Offline Resilience	Seamless local bootstrap fallback. Works offline out-of-the-box
🧠 Entropy-Guided Construction	Information-theoretic token selection for maximum domain efficiency

📊 Benchmarks — The Numbers Speak

100% HONEST. NO SUGARCOATING. DATA-DRIVEN.

Run python benchmark_competitive.py to reproduce these results yourself.

⚡ Speed Comparison

Tokenizer	Tokens/sec	vs CRAYON
🖍️ CRAYON (lite, 50k)	6,010,525	baseline
tiktoken (GPT-4)	524,469	11.5x slower
tiktoken (GPT-3)	466,823	12.9x slower
HF LLaMA (SP-BPE)	281,558	21.3x slower
HF GPT-2 (BPE)	237,117	25.3x slower
HF BERT (WordPiece)	202,269	29.7x slower
HF T5 (SentencePiece)	189,928	31.6x slower

📈 Full Benchmark Results

Tokenizer	Vocab Size	Tokens/sec	MB/sec	Load Time	Avg Time
CRAYON (lite, 50k)	50,000	6,010,525	15.33	0.54ms	4.56ms
tiktoken (cl100k/GPT-4)	100,000	524,469	2.18	0.01ms	32.03ms
tiktoken (p50k/GPT-3)	50,000	466,823	1.55	0.00ms	44.98ms
HF LLaMA (SP-BPE)	32,000	281,558	0.95	1212.02ms	73.52ms
HF GPT-2 (BPE)	50,257	237,117	0.69	2051.18ms	100.79ms
HF BERT (WordPiece)	30,522	202,269	0.73	1603.10ms	95.43ms
HF T5 (SentencePiece)	32,000	189,928	0.68	1727.91ms	102.15ms

📋 Test Environment & Methodology

Platform: Windows AMD64, Python 3.13.1
Test Text: 68.4 KB mixed content (code, prose, multilingual)
Iterations: 10 runs + 2 warmup per tokenizer
Full methodology: BENCHMARK_RESULTS.md

🏆 Key Takeaways

Metric	Result
✅ vs tiktoken (GPT-4)	11.5x faster
✅ vs HuggingFace GPT-2	25x faster
✅ Load time	0.54ms (vs 1-2s for HuggingFace)
✅ Peak throughput	10.4M tokens/sec (science profile)

Benchmark Comparison

⚡ Quick Start

Get tokenizing in under 60 seconds:

Option 1: Direct DAT Compilation

import json
import mmap
from crayon.c_ext.dat_builder import DATBuilder
from crayon.c_ext import crayon_fast

# Load any trained vocabulary
with open("trained_vocab_code.json", "r") as f:
    vocab_list = json.load(f)

# Compile to DAT (one-time, few seconds)
builder = DATBuilder()
builder.build(vocab_list)
builder.save("vocab_code.dat")

# Load into C++ engine via memory mapping (instant, <1ms)
with open("vocab_code.dat", "rb") as f:
    mm = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
    crayon_fast.load_dat(mm)

# Ultra-fast tokenization 🚀
code = 'fn main() { println!("Hello, World!"); }'
tokens = crayon_fast.tokenize(code)
print(f"Tokens: {tokens}")

Option 2: Profile System (Recommended)

from crayon.core.vocabulary import CrayonVocab

# Load pre-compiled profile (requires one-time compile_profiles.py)
vocab = CrayonVocab.load_profile("code")
tokens = vocab.tokenize("fn main() { }")
decoded = vocab.decode(tokens)
print(f"Decoded: {decoded}")

📦 Installation

git clone https://github.com/Xerv-AI/crayon.git
cd crayon
pip install -e .

Build the AVX2 Extension

python setup.py build_ext --inplace

Note: Requires a C++ compiler (MSVC on Windows, GCC/Clang on Linux/Mac).

🔧 One-Time Setup: Compile Profiles

# Builds .dat files → ~/.cache/xerv/crayon/profiles/
python compile_profiles.py

Each profile takes 38ms-26s depending on size. See DAT_BUILDING_EXPLAINED.md for details.

🧪 Verify Installation

python demo_tokenize.py

Expected output:

[1] Loading 'lite' profile...
    Status: 🚀 Fast C++ DAT Engine
[2] Tokenizing: 'Hello, world! This is Crayon.'
    Tokens IDs: [...]

🏎️ DAT Engine V2 Architecture

Crayon V2 uses a "God Tier" implementation combining:

┌─────────────┐      ┌──────────────┐      ┌─────────────┐      ┌──────────────┐
│ vocab.json  │ ──▶  │ DATBuilder   │ ──▶  │  vocab.dat  │ ──▶  │  C++ Engine  │
│   (List)    │      │  (Python)    │      │  (Binary)   │      │   (AVX2)     │
└─────────────┘      └──────────────┘      └─────────────┘      └──────────────┘

Component	File	Purpose
Offline Compiler	`dat_builder.py`	First-Fit algorithm → compact DAT binary
AVX2 Runtime	`engine.cpp`	Branchless state transitions + SIMD parallel ASCII
Zero-Copy Loader	`mmap` + buffer protocol	Instant startup, minimal RAM

🧩 Available Cartridges

5 production-ready profiles defined in src/crayon/core/profiles.py:

Profile	Size	Optimized For	Sources
`lite`	50k	Speed & Mobile	WikiText, RainDrop
`science`	250k	Reasoning (LaTeX, Quantum, Grad Math)	GRAD, Physics-700
`code`	250k	Syntax (Python, Rust, C++, JS)	CodeParrot, The Stack
`multilingual`	250k	Global (EU langs, Chinese, Hindi)	OSCAR, Wikipedia
`arts_commerce`	250k	Business (Legal, Finance, Lit)	PG19, Fin Phrasebank

vocab = CrayonVocab.load_profile("science")
vocab = CrayonVocab.load_profile("multilingual")

🛠️ Advanced Usage

Compile Vocabulary to DAT Format

from crayon.c_ext.dat_builder import DATBuilder
import json

with open("trained_vocab_lite.json", "r") as f:
    vocab = json.load(f)

builder = DATBuilder()
builder.build(vocab)
builder.save("vocab_lite.dat")

Direct C++ Engine Access

import mmap
from crayon.c_ext import crayon_fast

with open("vocab_lite.dat", "rb") as f:
    mm = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
    crayon_fast.load_dat(mm)

tokens = crayon_fast.tokenize("Your text here")

Force Rebuild / Offline Mode

# Rebuild from local resources only (fastest)
vocab = CrayonVocab.load_profile("arts_commerce", force_rebuild=True)

🏗️ Architecture

Layer	File	Purpose
Builder	`c_ext/dat_builder.py`	Offline DAT compiler
Engine	`c_ext/engine.cpp`	AVX2 SIMD runtime
Config	`core/profiles.py`	Cartridge definitions
Resources	`resources.py`	Streaming, fallbacks, caching

For a deep dive, read the Engineering Treatise.

🧪 Testing

# All tests
python -m pytest tests/ -v

# DAT engine tests
python -m pytest tests/test_c_ext.py -v

14/14 tests pass: DATBuilder, C++ module, full pipeline, Python fallback.

🔬 DAT Engine Verification

python verify_dat_engine.py

============================================================
XERV CRAYON V2.0 - HYPER-PRODUCTION DAT ENGINE VERIFICATION
============================================================
Vocabulary Size: 50,000 tokens
DAT Nodes: 163,000+
Throughput: 9,786,707 tokens/sec
STATUS: ✅ HYPER-PRODUCTION READY

📊 Training Data

Dataset	Size	Samples	Domain
Tiny Shakespeare	1.06 MB	1 (Full)	Classical Literature
RainDrop-DTS	179 KB	3,210	Instruction Following
Physics	332 KB	700	Scientific Reasoning
GRAD Math	5.00 MB	500*	Graduate Mathematics
TOTAL	~6.56 MB	4,411	Curated Corpus

_{*GRAD dataset limited to 500 high-density samples for efficient default build.}

🧩 API Reference

CrayonVocab

# Constructors
CrayonVocab(tokens: List[str], unk_token: str = "<UNK>")
CrayonVocab.from_corpus(corpus: str, target_size: int = 500000)
CrayonVocab.from_default_sources(vocab_size: int = 500000)
CrayonVocab.from_file(path: str)
CrayonVocab.from_json(path: str)
CrayonVocab.load_profile(name: str)  # Load cached DAT profiles

# Methods
vocab.tokenize(text: str) -> List[int]
vocab.decode(token_ids: List[int]) -> str
vocab.save(path: str, format: str = "txt")

DAT Builder

from crayon.c_ext.dat_builder import DATBuilder

builder = DATBuilder()
builder.build(vocab_list: List[str])
builder.save(output_path: str)

C++ Engine

from crayon.c_ext import crayon_fast

crayon_fast.load_dat(buffer)  # bytes, mmap, or memoryview
crayon_fast.tokenize(text: str) -> List[int]

Utilities

from crayon import check_c_extension, check_resources

print(check_c_extension())  # True/False
print(check_resources())     # Available data sources

🤝 Contributing

We welcome contributions! Whether it's new cartridges, performance optimizations, or bug fixes—open an issue or submit a PR.

📜 Citation

@techreport{xerv2026crayon,
  title={XERV Crayon: A First-Principles Analysis of Production-Grade Tokenization},
  author={Pal, Soham and Xerv Research},
  year={2026},
  institution={Xerv Research Engineering Division}
}

📄 License

Built with 💙 by Xerv Research Engineering Division

_{⭐ Star this repo if Crayon helps your project!}

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

5.3.6

Mar 28, 2026

5.3.4

Mar 28, 2026

5.3.3

Mar 20, 2026

5.3.2

Mar 20, 2026

5.3.1

Mar 20, 2026

5.3.0

Mar 20, 2026

5.2.9

Mar 20, 2026

5.2.8

Mar 20, 2026

5.2.7

Mar 20, 2026

5.2.6

Mar 20, 2026

5.2.5

Mar 17, 2026

5.2.4

Mar 17, 2026

5.2.3

Mar 17, 2026

5.2.2

Mar 17, 2026

5.2.1

Mar 17, 2026

5.2.0

Mar 17, 2026

5.1.3

Mar 17, 2026

5.1.2

Mar 17, 2026

5.1.0

Mar 2, 2026

5.0.1

Feb 25, 2026

4.3.0

Feb 1, 2026

4.1.9

Jan 31, 2026

4.1.8

Jan 26, 2026

4.1.7

Jan 26, 2026

4.1.6

Jan 26, 2026

4.1.5

Jan 26, 2026

4.1.4

Jan 26, 2026

4.1.3

Jan 26, 2026

4.1.2

Jan 26, 2026

4.1.1

Jan 26, 2026

4.1.0

Jan 26, 2026

4.0.9

Jan 26, 2026

4.0.8

Jan 26, 2026

4.0.7

Jan 26, 2026

4.0.6

Jan 26, 2026

4.0.5

Jan 26, 2026

4.0.4

Jan 26, 2026

4.0.3

Jan 26, 2026

4.0.2

Jan 26, 2026

4.0.1

Jan 26, 2026

This version

2.0.3

Jan 23, 2026

2.0.2

Jan 23, 2026

2.0.1

Jan 23, 2026

2.0.0

Jan 23, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

xerv_crayon-2.0.3.tar.gz (7.5 MB view details)

Uploaded Jan 23, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

xerv_crayon-2.0.3-cp313-cp313-win_amd64.whl (6.0 MB view details)

Uploaded Jan 23, 2026 CPython 3.13Windows x86-64

File details

Details for the file xerv_crayon-2.0.3.tar.gz.

File metadata

Download URL: xerv_crayon-2.0.3.tar.gz
Upload date: Jan 23, 2026
Size: 7.5 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.1

File hashes

Hashes for xerv_crayon-2.0.3.tar.gz
Algorithm	Hash digest
SHA256	`b266cc659734167cddac76ff782e791a21ab6ca220d7bbb2cdbe0c52c6d3725d`
MD5	`19b7cf334d0f36ea32e82ab7713d6e0b`
BLAKE2b-256	`b714386810006ee2ec33cc3448fa7a190b5eaf13ca75aa73562eaa12b524c19e`

See more details on using hashes here.

File details

Details for the file xerv_crayon-2.0.3-cp313-cp313-win_amd64.whl.

File metadata

Download URL: xerv_crayon-2.0.3-cp313-cp313-win_amd64.whl
Upload date: Jan 23, 2026
Size: 6.0 MB
Tags: CPython 3.13, Windows x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.1

File hashes

Hashes for xerv_crayon-2.0.3-cp313-cp313-win_amd64.whl
Algorithm	Hash digest
SHA256	`c0cd245f795ce5ca44bfb8ddf11e0f662b48472ae0e52a936bc9df73f5c00f2b`
MD5	`e1b84cf4a503cc79f4fa79f73d35061f`
BLAKE2b-256	`e5e0def2e5137d9e48e661b430d0b073fa07bf2c433101925be51571678fb3aa`

See more details on using hashes here.

xerv-crayon 2.0.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

🖍️ XERV Crayon

🚀 Key Features

📊 Benchmarks — The Numbers Speak

⚡ Speed Comparison

📈 Full Benchmark Results

🏆 Key Takeaways

⚡ Quick Start

Option 1: Direct DAT Compilation

Option 2: Profile System (Recommended)

📦 Installation

Build the AVX2 Extension

🔧 One-Time Setup: Compile Profiles

🧪 Verify Installation

🏎️ DAT Engine V2 Architecture

🧩 Available Cartridges

🛠️ Advanced Usage

🏗️ Architecture

🧪 Testing

🔬 DAT Engine Verification

📊 Training Data

🧩 API Reference

🤝 Contributing

📜 Citation

📄 License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes