Skip to main content

MLX-based LLM training and inference with hybrid RL architecture

Project description

Lazarus

MLX-based LLM training and tokenizer toolkit for Apple Silicon.

Quick Start with uvx

No installation needed - run directly with uvx:

# Encode text to see how a tokenizer splits it
uvx chuk-lazarus tokenizer encode -t "TinyLlama/TinyLlama-1.1B-Chat-v1.0" --text "Hello, world!"

# Run a health check on any tokenizer
uvx chuk-lazarus tokenizer doctor -t "gpt2"

# Compare how two tokenizers handle the same text
uvx chuk-lazarus tokenizer compare -t1 "gpt2" -t2 "TinyLlama/TinyLlama-1.1B-Chat-v1.0" --text "Machine learning is amazing"

Installation

# Install with uv (recommended)
uv add chuk-lazarus

# Or with pip
pip install chuk-lazarus

# For faster tokenization (optional MLX backend)
uv add "chuk-lazarus[fast]"

After installation, use the lazarus command directly:

lazarus tokenizer encode -t gpt2 --text "Hello"

Tokenizer CLI

The tokenizer CLI is a comprehensive toolkit for inspecting, analyzing, and debugging tokenizers. All commands work with any HuggingFace tokenizer.

Basic Commands

# Encode text - see token IDs and boundaries
uvx chuk-lazarus tokenizer encode -t "gpt2" --text "The quick brown fox"

# Decode token IDs back to text
uvx chuk-lazarus tokenizer decode -t "gpt2" --ids "464,2068,7586,21831"

# Search the vocabulary
uvx chuk-lazarus tokenizer vocab -t "gpt2" --search "hello"

# Show vocabulary statistics
uvx chuk-lazarus tokenizer vocab -t "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

Health Check & Fingerprinting

# Run comprehensive tokenizer health check
uvx chuk-lazarus tokenizer doctor -t "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

# Generate a fingerprint (for compatibility verification)
uvx chuk-lazarus tokenizer fingerprint -t "gpt2"

# Save fingerprint for CI/CD verification
uvx chuk-lazarus tokenizer fingerprint -t "gpt2" --save gpt2-fingerprint.json

# Verify tokenizer matches expected fingerprint
uvx chuk-lazarus tokenizer fingerprint -t "gpt2" --verify gpt2-fingerprint.json

Corpus Analysis

Analyze how well a tokenizer fits your dataset:

# Coverage analysis - UNK rate, tokens per word, vocab utilization
uvx chuk-lazarus tokenizer analyze coverage -t "gpt2" --file corpus.txt

# Entropy analysis - token distribution uniformity
uvx chuk-lazarus tokenizer analyze entropy -t "gpt2" --file corpus.txt

# Fit score - overall tokenizer-dataset compatibility (0-100)
uvx chuk-lazarus tokenizer analyze fit-score -t "gpt2" --file corpus.txt

# Efficiency analysis - tokens per sample, fragmentation
uvx chuk-lazarus tokenizer analyze efficiency -t "gpt2" --file corpus.txt

# Vocabulary suggestions - find tokens to add for better compression
uvx chuk-lazarus tokenizer analyze vocab-suggest -t "gpt2" --file corpus.txt

# Compare two tokenizers on your corpus
uvx chuk-lazarus tokenizer analyze diff -t1 "gpt2" -t2 "meta-llama/Llama-2-7b" -f corpus.txt

Instrumentation

Observability tools for understanding tokenization behavior:

# Token length histogram with ASCII visualization
uvx chuk-lazarus tokenizer instrument histogram -t "gpt2" --file corpus.txt

# OOV and rare token analysis
uvx chuk-lazarus tokenizer instrument oov -t "gpt2" --file corpus.txt --show-rare

# Padding and truncation waste analysis
uvx chuk-lazarus tokenizer instrument waste -t "gpt2" --file corpus.txt --max-length 512

# Compare vocabulary impact (before/after tokenizer swap)
uvx chuk-lazarus tokenizer instrument vocab-diff -t1 "gpt2" -t2 "meta-llama/Llama-2-7b" --file corpus.txt

Training Utilities

Tools for efficient training data preparation:

# Profile tokenization throughput
uvx chuk-lazarus tokenizer training throughput -t "gpt2" --file corpus.txt

# Pack sequences for efficient training (20-40% speedup)
uvx chuk-lazarus tokenizer training pack -t "gpt2" --file corpus.txt --max-length 512 -o packed.jsonl

# Create curriculum learning buckets by token length
uvx chuk-lazarus tokenizer curriculum length-buckets -t "gpt2" --file corpus.txt

# Score texts by reasoning density for curriculum ordering
uvx chuk-lazarus tokenizer curriculum reasoning-density -t "gpt2" --file corpus.txt

Regression Testing

Ensure tokenization doesn't change unexpectedly:

# Run regression tests from YAML file
uvx chuk-lazarus tokenizer regression run -t "gpt2" --tests tokenizer_tests.yaml

Example tokenizer_tests.yaml:

name: My Tokenizer Tests
tests:
  - name: basic_text
    text: "Hello, world!"
    assertion: exact_tokens
    expected: 4
  - name: roundtrip
    text: "The quick brown fox"
    assertion: roundtrip_lossless
  - name: math_symbols
    text: "x^2 + y^2 = z^2"
    assertion: max_tokens
    expected: 10

Python API

from chuk_lazarus.utils.tokenizer_loader import load_tokenizer
from chuk_lazarus.data.tokenizers.analyze import (
    analyze_coverage,
    analyze_entropy,
    calculate_fit_score,
)
from chuk_lazarus.data.tokenizers.fingerprint import compute_fingerprint

# Load any HuggingFace tokenizer
tokenizer = load_tokenizer("TinyLlama/TinyLlama-1.1B-Chat-v1.0")

# Analyze coverage on your corpus
texts = ["Your training data...", "More examples..."]
coverage = analyze_coverage(texts, tokenizer)
print(f"UNK rate: {coverage.unk_rate:.2%}")
print(f"Tokens per word: {coverage.tokens_per_word:.2f}")

# Calculate fit score
fit = calculate_fit_score(texts, tokenizer)
print(f"Fit score: {fit.score}/100 ({fit.grade})")

# Generate fingerprint for compatibility checks
fp = compute_fingerprint(tokenizer)
print(f"Fingerprint: {fp.fingerprint}")

See the Tokenizers README for comprehensive documentation of all analysis, preprocessing, and training utilities.

Training CLI

# Train with SFT
uvx chuk-lazarus train sft --model "TinyLlama/TinyLlama-1.1B-Chat-v1.0" --data train.jsonl --use-lora

# Train with DPO
uvx chuk-lazarus train dpo --model ./checkpoints/sft/final --data preferences.jsonl

# Generate synthetic training data
uvx chuk-lazarus generate --type math --output ./data/lazarus

# Run inference
uvx chuk-lazarus infer --model "TinyLlama/TinyLlama-1.1B-Chat-v1.0" --prompt "What is 2+2?"

Features

  • Tokenizer Toolkit: Encode, decode, analyze, compare, fingerprint, and debug any tokenizer
  • Training: SFT, DPO, GRPO, PPO trainers with LoRA support
  • Models: LLaMA, Mistral, Gemma, Granite, StarCoder2, TinyLlama
  • Analysis: Coverage, entropy, efficiency, fit scoring, vocabulary induction
  • Instrumentation: Histograms, OOV analysis, waste metrics, vocab comparison
  • CLI: Comprehensive command-line interface for all operations

Project Structure

src/chuk_lazarus/
├── cli/                    # Command-line interface
├── data/
│   ├── tokenizers/         # Tokenizer toolkit
│   │   ├── analyze/        # Coverage, entropy, fit scoring
│   │   ├── backends/       # HuggingFace + fast MLX backends
│   │   ├── curriculum/     # Length buckets, reasoning density
│   │   ├── instrumentation/# Histograms, OOV, waste metrics
│   │   ├── preprocessing/  # Hooks, profiles, byte fallback
│   │   ├── regression/     # Token regression testing
│   │   ├── research/       # Soft tokens, embedding analysis
│   │   ├── runtime/        # Special token registry
│   │   └── training/       # Packing, throughput profiling
│   └── generators/         # Synthetic data generation
├── models/                 # Model architectures and loading
├── training/               # Trainers (SFT, DPO, GRPO, PPO)
├── inference/              # Text generation
└── utils/                  # Utilities

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chuk_lazarus-0.2.1.tar.gz (193.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

chuk_lazarus-0.2.1-py3-none-any.whl (280.4 kB view details)

Uploaded Python 3

File details

Details for the file chuk_lazarus-0.2.1.tar.gz.

File metadata

  • Download URL: chuk_lazarus-0.2.1.tar.gz
  • Upload date:
  • Size: 193.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.10

File hashes

Hashes for chuk_lazarus-0.2.1.tar.gz
Algorithm Hash digest
SHA256 397613a30cda1279e614c387967d798d0881be5905b37471e0a4dd230964fe5e
MD5 9634a66b869edbd88b376042dc05f5e0
BLAKE2b-256 642deff43b345160d16c768223eaaef0278e64c00205714b25ff0ae3ea14bfdc

See more details on using hashes here.

File details

Details for the file chuk_lazarus-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: chuk_lazarus-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 280.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.10

File hashes

Hashes for chuk_lazarus-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 b2442f7a835ca26e7e3cdf62085a08bc0a0cef683be60dbb1467894c5d3b1f63
MD5 1342c4162985a2debd90a8a07589a0eb
BLAKE2b-256 8db8cba7e6ce3af02d1049d8c640e40fbb7fd971eb975ab0b603634b4fe419f7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page