MLX-based LLM training and inference with hybrid RL architecture
Project description
Lazarus
MLX-based LLM training and tokenizer toolkit for Apple Silicon.
Quick Start with uvx
No installation needed - run directly with uvx:
# Encode text to see how a tokenizer splits it
uvx chuk-lazarus tokenizer encode -t "TinyLlama/TinyLlama-1.1B-Chat-v1.0" --text "Hello, world!"
# Run a health check on any tokenizer
uvx chuk-lazarus tokenizer doctor -t "gpt2"
# Compare how two tokenizers handle the same text
uvx chuk-lazarus tokenizer compare -t1 "gpt2" -t2 "TinyLlama/TinyLlama-1.1B-Chat-v1.0" --text "Machine learning is amazing"
Installation
# Install with uv (recommended)
uv add chuk-lazarus
# Or with pip
pip install chuk-lazarus
# For faster tokenization (optional MLX backend)
uv add "chuk-lazarus[fast]"
After installation, use the lazarus command directly:
lazarus tokenizer encode -t gpt2 --text "Hello"
Tokenizer CLI
The tokenizer CLI is a comprehensive toolkit for inspecting, analyzing, and debugging tokenizers. All commands work with any HuggingFace tokenizer.
Basic Commands
# Encode text - see token IDs and boundaries
uvx chuk-lazarus tokenizer encode -t "gpt2" --text "The quick brown fox"
# Decode token IDs back to text
uvx chuk-lazarus tokenizer decode -t "gpt2" --ids "464,2068,7586,21831"
# Search the vocabulary
uvx chuk-lazarus tokenizer vocab -t "gpt2" --search "hello"
# Show vocabulary statistics
uvx chuk-lazarus tokenizer vocab -t "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
Health Check & Fingerprinting
# Run comprehensive tokenizer health check
uvx chuk-lazarus tokenizer doctor -t "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
# Generate a fingerprint (for compatibility verification)
uvx chuk-lazarus tokenizer fingerprint -t "gpt2"
# Save fingerprint for CI/CD verification
uvx chuk-lazarus tokenizer fingerprint -t "gpt2" --save gpt2-fingerprint.json
# Verify tokenizer matches expected fingerprint
uvx chuk-lazarus tokenizer fingerprint -t "gpt2" --verify gpt2-fingerprint.json
Corpus Analysis
Analyze how well a tokenizer fits your dataset:
# Coverage analysis - UNK rate, tokens per word, vocab utilization
uvx chuk-lazarus tokenizer analyze coverage -t "gpt2" --file corpus.txt
# Entropy analysis - token distribution uniformity
uvx chuk-lazarus tokenizer analyze entropy -t "gpt2" --file corpus.txt
# Fit score - overall tokenizer-dataset compatibility (0-100)
uvx chuk-lazarus tokenizer analyze fit-score -t "gpt2" --file corpus.txt
# Efficiency analysis - tokens per sample, fragmentation
uvx chuk-lazarus tokenizer analyze efficiency -t "gpt2" --file corpus.txt
# Vocabulary suggestions - find tokens to add for better compression
uvx chuk-lazarus tokenizer analyze vocab-suggest -t "gpt2" --file corpus.txt
# Compare two tokenizers on your corpus
uvx chuk-lazarus tokenizer analyze diff -t1 "gpt2" -t2 "meta-llama/Llama-2-7b" -f corpus.txt
Instrumentation
Observability tools for understanding tokenization behavior:
# Token length histogram with ASCII visualization
uvx chuk-lazarus tokenizer instrument histogram -t "gpt2" --file corpus.txt
# OOV and rare token analysis
uvx chuk-lazarus tokenizer instrument oov -t "gpt2" --file corpus.txt --show-rare
# Padding and truncation waste analysis
uvx chuk-lazarus tokenizer instrument waste -t "gpt2" --file corpus.txt --max-length 512
# Compare vocabulary impact (before/after tokenizer swap)
uvx chuk-lazarus tokenizer instrument vocab-diff -t1 "gpt2" -t2 "meta-llama/Llama-2-7b" --file corpus.txt
Training Utilities
Tools for efficient training data preparation:
# Profile tokenization throughput
uvx chuk-lazarus tokenizer training throughput -t "gpt2" --file corpus.txt
# Pack sequences for efficient training (20-40% speedup)
uvx chuk-lazarus tokenizer training pack -t "gpt2" --file corpus.txt --max-length 512 -o packed.jsonl
# Create curriculum learning buckets by token length
uvx chuk-lazarus tokenizer curriculum length-buckets -t "gpt2" --file corpus.txt
# Score texts by reasoning density for curriculum ordering
uvx chuk-lazarus tokenizer curriculum reasoning-density -t "gpt2" --file corpus.txt
Regression Testing
Ensure tokenization doesn't change unexpectedly:
# Run regression tests from YAML file
uvx chuk-lazarus tokenizer regression run -t "gpt2" --tests tokenizer_tests.yaml
Example tokenizer_tests.yaml:
name: My Tokenizer Tests
tests:
- name: basic_text
text: "Hello, world!"
assertion: exact_tokens
expected: 4
- name: roundtrip
text: "The quick brown fox"
assertion: roundtrip_lossless
- name: math_symbols
text: "x^2 + y^2 = z^2"
assertion: max_tokens
expected: 10
Python API
from chuk_lazarus.utils.tokenizer_loader import load_tokenizer
from chuk_lazarus.data.tokenizers.analyze import (
analyze_coverage,
analyze_entropy,
calculate_fit_score,
)
from chuk_lazarus.data.tokenizers.fingerprint import compute_fingerprint
# Load any HuggingFace tokenizer
tokenizer = load_tokenizer("TinyLlama/TinyLlama-1.1B-Chat-v1.0")
# Analyze coverage on your corpus
texts = ["Your training data...", "More examples..."]
coverage = analyze_coverage(texts, tokenizer)
print(f"UNK rate: {coverage.unk_rate:.2%}")
print(f"Tokens per word: {coverage.tokens_per_word:.2f}")
# Calculate fit score
fit = calculate_fit_score(texts, tokenizer)
print(f"Fit score: {fit.score}/100 ({fit.grade})")
# Generate fingerprint for compatibility checks
fp = compute_fingerprint(tokenizer)
print(f"Fingerprint: {fp.fingerprint}")
See the Tokenizers README for comprehensive documentation of all analysis, preprocessing, and training utilities.
Training CLI
# Train with SFT
uvx chuk-lazarus train sft --model "TinyLlama/TinyLlama-1.1B-Chat-v1.0" --data train.jsonl --use-lora
# Train with DPO
uvx chuk-lazarus train dpo --model ./checkpoints/sft/final --data preferences.jsonl
# Generate synthetic training data
uvx chuk-lazarus generate --type math --output ./data/lazarus
# Run inference
uvx chuk-lazarus infer --model "TinyLlama/TinyLlama-1.1B-Chat-v1.0" --prompt "What is 2+2?"
Features
- Tokenizer Toolkit: Encode, decode, analyze, compare, fingerprint, and debug any tokenizer
- Training: SFT, DPO, GRPO, PPO trainers with LoRA support
- Models: LLaMA, Mistral, Gemma, Granite, StarCoder2, TinyLlama
- Analysis: Coverage, entropy, efficiency, fit scoring, vocabulary induction
- Instrumentation: Histograms, OOV analysis, waste metrics, vocab comparison
- CLI: Comprehensive command-line interface for all operations
Project Structure
src/chuk_lazarus/
├── cli/ # Command-line interface
├── data/
│ ├── tokenizers/ # Tokenizer toolkit
│ │ ├── analyze/ # Coverage, entropy, fit scoring
│ │ ├── backends/ # HuggingFace + fast MLX backends
│ │ ├── curriculum/ # Length buckets, reasoning density
│ │ ├── instrumentation/# Histograms, OOV, waste metrics
│ │ ├── preprocessing/ # Hooks, profiles, byte fallback
│ │ ├── regression/ # Token regression testing
│ │ ├── research/ # Soft tokens, embedding analysis
│ │ ├── runtime/ # Special token registry
│ │ └── training/ # Packing, throughput profiling
│ └── generators/ # Synthetic data generation
├── models/ # Model architectures and loading
├── training/ # Trainers (SFT, DPO, GRPO, PPO)
├── inference/ # Text generation
└── utils/ # Utilities
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file chuk_lazarus-0.2.1.tar.gz.
File metadata
- Download URL: chuk_lazarus-0.2.1.tar.gz
- Upload date:
- Size: 193.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
397613a30cda1279e614c387967d798d0881be5905b37471e0a4dd230964fe5e
|
|
| MD5 |
9634a66b869edbd88b376042dc05f5e0
|
|
| BLAKE2b-256 |
642deff43b345160d16c768223eaaef0278e64c00205714b25ff0ae3ea14bfdc
|
File details
Details for the file chuk_lazarus-0.2.1-py3-none-any.whl.
File metadata
- Download URL: chuk_lazarus-0.2.1-py3-none-any.whl
- Upload date:
- Size: 280.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b2442f7a835ca26e7e3cdf62085a08bc0a0cef683be60dbb1467894c5d3b1f63
|
|
| MD5 |
1342c4162985a2debd90a8a07589a0eb
|
|
| BLAKE2b-256 |
8db8cba7e6ce3af02d1049d8c640e40fbb7fd971eb975ab0b603634b4fe419f7
|