MLX-based LLM training and inference with hybrid RL architecture

Project description

Lazarus

MLX-based LLM training and tokenizer toolkit for Apple Silicon.

Quick Start with uvx

No installation needed - run directly with uvx:

# Encode text to see how a tokenizer splits it
uvx chuk-lazarus tokenizer encode -t "TinyLlama/TinyLlama-1.1B-Chat-v1.0" --text "Hello, world!"

# Run a health check on any tokenizer
uvx chuk-lazarus tokenizer doctor -t "gpt2"

# Compare how two tokenizers handle the same text
uvx chuk-lazarus tokenizer compare -t1 "gpt2" -t2 "TinyLlama/TinyLlama-1.1B-Chat-v1.0" --text "Machine learning is amazing"

Installation

# Install with uv (recommended)
uv add chuk-lazarus

# Or with pip
pip install chuk-lazarus

# For OpenAI tokenizers (gpt-4, gpt-3.5-turbo, o1, etc.)
uv add "chuk-lazarus[openai]"

# For faster tokenization (optional MLX backend)
uv add "chuk-lazarus[fast]"

After installation, use the chuk-lazarus command directly:

chuk-lazarus tokenizer encode -t "gpt2" --text "Hello"

Tokenizer CLI

The tokenizer CLI is a comprehensive toolkit for inspecting, analyzing, and debugging tokenizers. Supports HuggingFace tokenizers and OpenAI/tiktoken models.

Basic Commands

# Encode text - see token IDs and boundaries
uvx chuk-lazarus tokenizer encode -t "gpt2" --text "The quick brown fox"

# Decode token IDs back to text
uvx chuk-lazarus tokenizer decode -t "gpt2" --ids "464,2068,7586,21831"

# Search the vocabulary
uvx chuk-lazarus tokenizer vocab -t "gpt2" --search "hello"

# Show vocabulary statistics
uvx chuk-lazarus tokenizer vocab -t "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

OpenAI Tokenizers

Analyze OpenAI models using tiktoken (requires chuk-lazarus[openai]):

# Encode with GPT-4's tokenizer
uvx "chuk-lazarus[openai]" tokenizer encode -t "gpt-4" --text "Hello, world!"

# Compare GPT-4 vs GPT-4o tokenization
uvx "chuk-lazarus[openai]" tokenizer compare -t1 "gpt-4" -t2 "gpt-4o" --text "Machine learning is amazing"

# Health check on GPT-3.5-turbo tokenizer
uvx "chuk-lazarus[openai]" tokenizer doctor -t "gpt-3.5-turbo"

# Use encoding names directly
uvx "chuk-lazarus[openai]" tokenizer encode -t "cl100k_base" --text "Hello"   # GPT-4 encoding
uvx "chuk-lazarus[openai]" tokenizer encode -t "o200k_base" --text "Hello"    # GPT-4o encoding

Supported OpenAI models: gpt-4, gpt-4-turbo, gpt-4o, gpt-4o-mini, gpt-3.5-turbo, o1, o1-mini, o3-mini, and more.

Health Check & Fingerprinting

# Run comprehensive tokenizer health check
uvx chuk-lazarus tokenizer doctor -t "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

# Generate a fingerprint (for compatibility verification)
uvx chuk-lazarus tokenizer fingerprint -t "gpt2"

# Save fingerprint for CI/CD verification
uvx chuk-lazarus tokenizer fingerprint -t "gpt2" --save gpt2-fingerprint.json

# Verify tokenizer matches expected fingerprint
uvx chuk-lazarus tokenizer fingerprint -t "gpt2" --verify gpt2-fingerprint.json

Corpus Analysis

Analyze how well a tokenizer fits your dataset:

# Coverage analysis - UNK rate, tokens per word, vocab utilization
uvx chuk-lazarus tokenizer analyze coverage -t "gpt2" --file corpus.txt

# Entropy analysis - token distribution uniformity
uvx chuk-lazarus tokenizer analyze entropy -t "gpt2" --file corpus.txt

# Fit score - overall tokenizer-dataset compatibility (0-100)
uvx chuk-lazarus tokenizer analyze fit-score -t "gpt2" --file corpus.txt

# Efficiency analysis - tokens per sample, fragmentation
uvx chuk-lazarus tokenizer analyze efficiency -t "gpt2" --file corpus.txt

# Vocabulary suggestions - find tokens to add for better compression
uvx chuk-lazarus tokenizer analyze vocab-suggest -t "gpt2" --file corpus.txt

# Compare two tokenizers on your corpus
uvx chuk-lazarus tokenizer analyze diff -t1 "gpt2" -t2 "meta-llama/Llama-2-7b" -f corpus.txt

Instrumentation

Observability tools for understanding tokenization behavior:

# Token length histogram with ASCII visualization
uvx chuk-lazarus tokenizer instrument histogram -t "gpt2" --file corpus.txt

# OOV and rare token analysis
uvx chuk-lazarus tokenizer instrument oov -t "gpt2" --file corpus.txt --show-rare

# Padding and truncation waste analysis
uvx chuk-lazarus tokenizer instrument waste -t "gpt2" --file corpus.txt --max-length 512

# Compare vocabulary impact (before/after tokenizer swap)
uvx chuk-lazarus tokenizer instrument vocab-diff -t1 "gpt2" -t2 "meta-llama/Llama-2-7b" --file corpus.txt

Training Utilities

Tools for efficient training data preparation:

# Profile tokenization throughput
uvx chuk-lazarus tokenizer training throughput -t "gpt2" --file corpus.txt

# Pack sequences for efficient training (20-40% speedup)
uvx chuk-lazarus tokenizer training pack -t "gpt2" --file corpus.txt --max-length 512 -o packed.jsonl

# Create curriculum learning buckets by token length
uvx chuk-lazarus tokenizer curriculum length-buckets -t "gpt2" --file corpus.txt

# Score texts by reasoning density for curriculum ordering
uvx chuk-lazarus tokenizer curriculum reasoning-density -t "gpt2" --file corpus.txt

Regression Testing

Ensure tokenization doesn't change unexpectedly:

# Run regression tests from YAML file
uvx chuk-lazarus tokenizer regression run -t "gpt2" --tests tokenizer_tests.yaml

Example tokenizer_tests.yaml:

name: My Tokenizer Tests
tests:
  - name: basic_text
    text: "Hello, world!"
    assertion: exact_tokens
    expected: 4
  - name: roundtrip
    text: "The quick brown fox"
    assertion: roundtrip_lossless
  - name: math_symbols
    text: "x^2 + y^2 = z^2"
    assertion: max_tokens
    expected: 10

Python API

from chuk_lazarus.utils.tokenizer_loader import load_tokenizer
from chuk_lazarus.data.tokenizers.analyze import (
    analyze_coverage,
    analyze_entropy,
    calculate_fit_score,
)
from chuk_lazarus.data.tokenizers.fingerprint import compute_fingerprint

# Load any HuggingFace tokenizer
tokenizer = load_tokenizer("TinyLlama/TinyLlama-1.1B-Chat-v1.0")

# Analyze coverage on your corpus
texts = ["Your training data...", "More examples..."]
coverage = analyze_coverage(texts, tokenizer)
print(f"UNK rate: {coverage.unk_rate:.2%}")
print(f"Tokens per word: {coverage.tokens_per_word:.2f}")

# Calculate fit score
fit = calculate_fit_score(texts, tokenizer)
print(f"Fit score: {fit.score}/100 ({fit.grade})")

# Generate fingerprint for compatibility checks
fp = compute_fingerprint(tokenizer)
print(f"Fingerprint: {fp.fingerprint}")

See the Tokenizers README for comprehensive documentation of all analysis, preprocessing, and training utilities.

Training CLI

# Train with SFT
uvx chuk-lazarus train sft --model "TinyLlama/TinyLlama-1.1B-Chat-v1.0" --data train.jsonl --use-lora

# Train with DPO
uvx chuk-lazarus train dpo --model ./checkpoints/sft/final --data preferences.jsonl

# Generate synthetic training data
uvx chuk-lazarus generate --type math --output ./data/lazarus

# Run inference
uvx chuk-lazarus infer --model "TinyLlama/TinyLlama-1.1B-Chat-v1.0" --prompt "What is 2+2?"

Features

Tokenizer Toolkit: Encode, decode, analyze, compare, fingerprint, and debug any tokenizer
Training: SFT, DPO, GRPO, PPO trainers with LoRA support
Models: LLaMA, Mistral, Gemma, Granite, StarCoder2, TinyLlama
Analysis: Coverage, entropy, efficiency, fit scoring, vocabulary induction
Instrumentation: Histograms, OOV analysis, waste metrics, vocab comparison
CLI: Comprehensive command-line interface for all operations

Project Structure

src/chuk_lazarus/
├── cli/                    # Command-line interface
├── data/
│   ├── tokenizers/         # Tokenizer toolkit
│   │   ├── analyze/        # Coverage, entropy, fit scoring
│   │   ├── backends/       # HuggingFace + fast MLX backends
│   │   ├── curriculum/     # Length buckets, reasoning density
│   │   ├── instrumentation/# Histograms, OOV, waste metrics
│   │   ├── preprocessing/  # Hooks, profiles, byte fallback
│   │   ├── regression/     # Token regression testing
│   │   ├── research/       # Soft tokens, embedding analysis
│   │   ├── runtime/        # Special token registry
│   │   └── training/       # Packing, throughput profiling
│   └── generators/         # Synthetic data generation
├── models/                 # Model architectures and loading
├── training/               # Trainers (SFT, DPO, GRPO, PPO)
├── inference/              # Text generation
└── utils/                  # Utilities

License

MIT

Project details

Release history Release notifications | RSS feed

0.4.1

Mar 4, 2026

0.4

Dec 26, 2025

0.3

Dec 25, 2025

0.2.3

Dec 22, 2025

This version

0.2.2

Dec 22, 2025

0.2.1

Dec 22, 2025

0.2.0

Dec 22, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chuk_lazarus-0.2.2.tar.gz (197.4 kB view details)

Uploaded Dec 22, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

chuk_lazarus-0.2.2-py3-none-any.whl (284.6 kB view details)

Uploaded Dec 22, 2025 Python 3

File details

Details for the file chuk_lazarus-0.2.2.tar.gz.

File metadata

Download URL: chuk_lazarus-0.2.2.tar.gz
Upload date: Dec 22, 2025
Size: 197.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.2

File hashes

Hashes for chuk_lazarus-0.2.2.tar.gz
Algorithm	Hash digest
SHA256	`c2cfa89710e2858031a41e06244c129b49bea70486b8607e2c36798a13baaa0b`
MD5	`8a8389784971aaaeb248aa3627193806`
BLAKE2b-256	`3bb37a40bfd5fbdd40df7ac29ad1d156d5bf287c8c78b336d54448c19de5ca25`

See more details on using hashes here.

File details

Details for the file chuk_lazarus-0.2.2-py3-none-any.whl.

File metadata

Download URL: chuk_lazarus-0.2.2-py3-none-any.whl
Upload date: Dec 22, 2025
Size: 284.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.2

File hashes

Hashes for chuk_lazarus-0.2.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`08c04c35a4ddb731f969b366de0e93d8846478b795e312a3c4527f82860a2425`
MD5	`0e423d7b33796f146f28c176c2f862e1`
BLAKE2b-256	`b1ddb7e1e152ab2c38fe2179df7b059d684cac62efdb1e6431d6127c330ad009`

See more details on using hashes here.

chuk-lazarus 0.2.2

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Lazarus

Quick Start with uvx

Installation

Tokenizer CLI

Basic Commands

OpenAI Tokenizers

Health Check & Fingerprinting

Corpus Analysis

Instrumentation

Training Utilities

Regression Testing

Python API

Training CLI

Features

Project Structure

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes