Skip to main content

MLX-based LLM training and inference with hybrid RL architecture

Project description

Lazarus

MLX-based LLM training and tokenizer toolkit for Apple Silicon.

Quick Start with uvx

No installation needed - run directly with uvx:

# Encode text to see how a tokenizer splits it
uvx chuk-lazarus tokenizer encode -t "TinyLlama/TinyLlama-1.1B-Chat-v1.0" --text "Hello, world!"

# Run a health check on any tokenizer
uvx chuk-lazarus tokenizer doctor -t "gpt2"

# Compare how two tokenizers handle the same text
uvx chuk-lazarus tokenizer compare -t1 "gpt2" -t2 "TinyLlama/TinyLlama-1.1B-Chat-v1.0" --text "Machine learning is amazing"

Installation

# Install with uv (recommended)
uv add chuk-lazarus

# Or with pip
pip install chuk-lazarus

# For OpenAI tokenizers (gpt-4, gpt-3.5-turbo, o1, etc.)
uv add "chuk-lazarus[openai]"

# For faster tokenization (optional MLX backend)
uv add "chuk-lazarus[fast]"

After installation, use the chuk-lazarus command directly:

chuk-lazarus tokenizer encode -t "gpt2" --text "Hello"

Tokenizer CLI

The tokenizer CLI is a comprehensive toolkit for inspecting, analyzing, and debugging tokenizers. Supports HuggingFace tokenizers and OpenAI/tiktoken models.

Basic Commands

# Encode text - see token IDs and boundaries
uvx chuk-lazarus tokenizer encode -t "gpt2" --text "The quick brown fox"

# Decode token IDs back to text
uvx chuk-lazarus tokenizer decode -t "gpt2" --ids "464,2068,7586,21831"

# Search the vocabulary
uvx chuk-lazarus tokenizer vocab -t "gpt2" --search "hello"

# Show vocabulary statistics
uvx chuk-lazarus tokenizer vocab -t "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

OpenAI Tokenizers

Analyze OpenAI models using tiktoken (requires chuk-lazarus[openai]):

# Encode with GPT-4's tokenizer
uvx "chuk-lazarus[openai]" tokenizer encode -t "gpt-4" --text "Hello, world!"

# Compare GPT-4 vs GPT-4o tokenization
uvx "chuk-lazarus[openai]" tokenizer compare -t1 "gpt-4" -t2 "gpt-4o" --text "Machine learning is amazing"

# Health check on GPT-3.5-turbo tokenizer
uvx "chuk-lazarus[openai]" tokenizer doctor -t "gpt-3.5-turbo"

# Use encoding names directly
uvx "chuk-lazarus[openai]" tokenizer encode -t "cl100k_base" --text "Hello"   # GPT-4 encoding
uvx "chuk-lazarus[openai]" tokenizer encode -t "o200k_base" --text "Hello"    # GPT-4o encoding

Supported OpenAI models: gpt-4, gpt-4-turbo, gpt-4o, gpt-4o-mini, gpt-3.5-turbo, o1, o1-mini, o3-mini, and more.

Health Check & Fingerprinting

# Run comprehensive tokenizer health check
uvx chuk-lazarus tokenizer doctor -t "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

# Generate a fingerprint (for compatibility verification)
uvx chuk-lazarus tokenizer fingerprint -t "gpt2"

# Save fingerprint for CI/CD verification
uvx chuk-lazarus tokenizer fingerprint -t "gpt2" --save gpt2-fingerprint.json

# Verify tokenizer matches expected fingerprint
uvx chuk-lazarus tokenizer fingerprint -t "gpt2" --verify gpt2-fingerprint.json

Corpus Analysis

Analyze how well a tokenizer fits your dataset:

# Coverage analysis - UNK rate, tokens per word, vocab utilization
uvx chuk-lazarus tokenizer analyze coverage -t "gpt2" --file corpus.txt

# Entropy analysis - token distribution uniformity
uvx chuk-lazarus tokenizer analyze entropy -t "gpt2" --file corpus.txt

# Fit score - overall tokenizer-dataset compatibility (0-100)
uvx chuk-lazarus tokenizer analyze fit-score -t "gpt2" --file corpus.txt

# Efficiency analysis - tokens per sample, fragmentation
uvx chuk-lazarus tokenizer analyze efficiency -t "gpt2" --file corpus.txt

# Vocabulary suggestions - find tokens to add for better compression
uvx chuk-lazarus tokenizer analyze vocab-suggest -t "gpt2" --file corpus.txt

# Compare two tokenizers on your corpus
uvx chuk-lazarus tokenizer analyze diff -t1 "gpt2" -t2 "meta-llama/Llama-2-7b" -f corpus.txt

Instrumentation

Observability tools for understanding tokenization behavior:

# Token length histogram with ASCII visualization
uvx chuk-lazarus tokenizer instrument histogram -t "gpt2" --file corpus.txt

# OOV and rare token analysis
uvx chuk-lazarus tokenizer instrument oov -t "gpt2" --file corpus.txt --show-rare

# Padding and truncation waste analysis
uvx chuk-lazarus tokenizer instrument waste -t "gpt2" --file corpus.txt --max-length 512

# Compare vocabulary impact (before/after tokenizer swap)
uvx chuk-lazarus tokenizer instrument vocab-diff -t1 "gpt2" -t2 "meta-llama/Llama-2-7b" --file corpus.txt

Training Utilities

Tools for efficient training data preparation:

# Profile tokenization throughput
uvx chuk-lazarus tokenizer training throughput -t "gpt2" --file corpus.txt

# Pack sequences for efficient training (20-40% speedup)
uvx chuk-lazarus tokenizer training pack -t "gpt2" --file corpus.txt --max-length 512 -o packed.jsonl

# Create curriculum learning buckets by token length
uvx chuk-lazarus tokenizer curriculum length-buckets -t "gpt2" --file corpus.txt

# Score texts by reasoning density for curriculum ordering
uvx chuk-lazarus tokenizer curriculum reasoning-density -t "gpt2" --file corpus.txt

Regression Testing

Ensure tokenization doesn't change unexpectedly:

# Run regression tests from YAML file
uvx chuk-lazarus tokenizer regression run -t "gpt2" --tests tokenizer_tests.yaml

Example tokenizer_tests.yaml:

name: My Tokenizer Tests
tests:
  - name: basic_text
    text: "Hello, world!"
    assertion: exact_tokens
    expected: 4
  - name: roundtrip
    text: "The quick brown fox"
    assertion: roundtrip_lossless
  - name: math_symbols
    text: "x^2 + y^2 = z^2"
    assertion: max_tokens
    expected: 10

Python API

from chuk_lazarus.utils.tokenizer_loader import load_tokenizer
from chuk_lazarus.data.tokenizers.analyze import (
    analyze_coverage,
    analyze_entropy,
    calculate_fit_score,
)
from chuk_lazarus.data.tokenizers.fingerprint import compute_fingerprint

# Load any HuggingFace tokenizer
tokenizer = load_tokenizer("TinyLlama/TinyLlama-1.1B-Chat-v1.0")

# Analyze coverage on your corpus
texts = ["Your training data...", "More examples..."]
coverage = analyze_coverage(texts, tokenizer)
print(f"UNK rate: {coverage.unk_rate:.2%}")
print(f"Tokens per word: {coverage.tokens_per_word:.2f}")

# Calculate fit score
fit = calculate_fit_score(texts, tokenizer)
print(f"Fit score: {fit.score}/100 ({fit.grade})")

# Generate fingerprint for compatibility checks
fp = compute_fingerprint(tokenizer)
print(f"Fingerprint: {fp.fingerprint}")

See the Tokenizers README for comprehensive documentation of all analysis, preprocessing, and training utilities.

Training CLI

# Train with SFT
uvx chuk-lazarus train sft --model "TinyLlama/TinyLlama-1.1B-Chat-v1.0" --data train.jsonl --use-lora

# Train with DPO
uvx chuk-lazarus train dpo --model ./checkpoints/sft/final --data preferences.jsonl

# Generate synthetic training data
uvx chuk-lazarus generate --type math --output ./data/lazarus

# Run inference
uvx chuk-lazarus infer --model "TinyLlama/TinyLlama-1.1B-Chat-v1.0" --prompt "What is 2+2?"

Features

  • Tokenizer Toolkit: Encode, decode, analyze, compare, fingerprint, and debug any tokenizer
  • Training: SFT, DPO, GRPO, PPO trainers with LoRA support
  • Models: LLaMA, Mistral, Gemma, Granite, StarCoder2, TinyLlama
  • Analysis: Coverage, entropy, efficiency, fit scoring, vocabulary induction
  • Instrumentation: Histograms, OOV analysis, waste metrics, vocab comparison
  • CLI: Comprehensive command-line interface for all operations

Project Structure

src/chuk_lazarus/
├── cli/                    # Command-line interface
├── data/
│   ├── tokenizers/         # Tokenizer toolkit
│   │   ├── analyze/        # Coverage, entropy, fit scoring
│   │   ├── backends/       # HuggingFace + fast MLX backends
│   │   ├── curriculum/     # Length buckets, reasoning density
│   │   ├── instrumentation/# Histograms, OOV, waste metrics
│   │   ├── preprocessing/  # Hooks, profiles, byte fallback
│   │   ├── regression/     # Token regression testing
│   │   ├── research/       # Soft tokens, embedding analysis
│   │   ├── runtime/        # Special token registry
│   │   └── training/       # Packing, throughput profiling
│   └── generators/         # Synthetic data generation
├── models/                 # Model architectures and loading
├── training/               # Trainers (SFT, DPO, GRPO, PPO)
├── inference/              # Text generation
└── utils/                  # Utilities

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chuk_lazarus-0.2.2.tar.gz (197.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

chuk_lazarus-0.2.2-py3-none-any.whl (284.6 kB view details)

Uploaded Python 3

File details

Details for the file chuk_lazarus-0.2.2.tar.gz.

File metadata

  • Download URL: chuk_lazarus-0.2.2.tar.gz
  • Upload date:
  • Size: 197.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.2

File hashes

Hashes for chuk_lazarus-0.2.2.tar.gz
Algorithm Hash digest
SHA256 c2cfa89710e2858031a41e06244c129b49bea70486b8607e2c36798a13baaa0b
MD5 8a8389784971aaaeb248aa3627193806
BLAKE2b-256 3bb37a40bfd5fbdd40df7ac29ad1d156d5bf287c8c78b336d54448c19de5ca25

See more details on using hashes here.

File details

Details for the file chuk_lazarus-0.2.2-py3-none-any.whl.

File metadata

  • Download URL: chuk_lazarus-0.2.2-py3-none-any.whl
  • Upload date:
  • Size: 284.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.2

File hashes

Hashes for chuk_lazarus-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 08c04c35a4ddb731f969b366de0e93d8846478b795e312a3c4527f82860a2425
MD5 0e423d7b33796f146f28c176c2f862e1
BLAKE2b-256 b1ddb7e1e152ab2c38fe2179df7b059d684cac62efdb1e6431d6127c330ad009

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page