Entropy-Routed Dynamic Quantization for LLM Inference
Project description
Gearbx: Entropy-Routed Dynamic Quantization
An entropy-routed dynamic quantization engine for LLM inference that adjusts weight precision on a per-token basis during generation.
Website: gearbx.jpdz.app
How It Works
Not all tokens require the same computational fidelity. Producing "Hello, how can I help you?" demands almost no semantic reasoning, while the next token in a partial differential equation imposes high cognitive load on the model.
Gearbx monitors the model's own output entropy at each generation step and routes subsequent forward passes through pre-cached layer weights at the appropriate precision tier:
| Gear | Precision | Memory per Param | When |
|---|---|---|---|
low |
4-bit packed | 0.5 bytes (25% of fp16) | Filler tokens, greetings, boilerplate |
mid |
8-bit packed | 1 byte (50% of fp16) | Standard reasoning, moderate difficulty |
high |
fp16 | 2 bytes (original) | Complex reasoning, math, proofs |
Autoregressive LLM inference is memory-bandwidth bound: each generated token loads the full weight matrix. Packed storage means fewer bytes transferred per token, translating directly to speed gains proportional to compression ratio.
Architecture
Prompt → [PromptClassifier] → initial_gear
│
┌───────────▼──────────────────────────┐
│ GENERATION LOOP │
│ │
│ input_ids → model.forward() → logits│
│ │ │
│ [EntropyMonitor] │
│ Shannon entropy → gear decision │
│ (rolling avg + hysteresis) │
│ │ │
│ [PrecisionManager] │
│ lazy quantize + module swap │
│ (originals offloaded to CPU) │
│ │ │
│ sample next_token │
└──────────────────────────────────────┘
│
generated_ids → decode → output_text
Core Components
-
EntropyMonitor (
monitor.py: Computes Shannon entropy (bits, log base 2) from logit distributions. Rolling-window average smooths gear transitions. Supports vocab-size auto-scaling (reference: 32768 tokens), hysteresis to prevent oscillation near boundaries, and auto-calibration from observed entropy distribution (p40/p60 thresholds). Deferred pipeline mode eliminates per-token GPU→CPU sync. -
PrecisionManager (
precision.py: Lazy quantization with direct module-reference swaps. Discovers attention layers at init viadetect_attn_prefixes()but defers quantization to firstshift()call. On shift: quantizes from originals, swaps quantized module into model tree, offloads originals to CPU. Only one quantized variant exists at any time, memory goes down, not up. -
QuantizedLinear Kernels (
kernels.py: Real packed integer storage: int8 (per-channel symmetric), int6, int4 (2 values per byte), int2 (4 values per byte, ternary {-1,0,1}). Dual-path forward: fused Triton kernels on CUDA (in-register dequant, no intermediate tensor), dequant-cache path on MPS/CPU (first forward unpacks to fp16, subsequent forwards reuse cached tensor). -
Native Acceleration (
native_kernels.py+csrc/: Three tiers below Triton: MPS native Metal shaders via PyTorch's MetalShaderLibrary (zero-copy, threadgroup 256), legacy Metal command buffers, and CPU NEON vectorized matmul (ARM-O3 -march=armv8.2-a+fp16). -
Fused Triton Kernels (
triton_kernels.py: CUDA-only fused quantized matmul. Multiplies activations directly with packed int8/int4/int2 weights without materializing fp16. In-register dequantization - only packed data traverses global memory. -
GearbxModel (
model.py: Orchestrates everything. Loads model via transformers, auto-detects attention architecture, passes vocab_size to EntropyMonitor for threshold scaling. Manual autoregressive loop with KV caching and NaN guards for MPS stability. -
StatisticalPromptClassifier (
classifier.py: Cold-start heuristic. Scores prompt complexity from math symbols, code keywords, length, avg word length → initial gear before the entropy window fills.
Backends
| Backend | Module | Routing | Weight Access | Install |
|---|---|---|---|---|
| Transformers | GearbxModel |
Per-token | Direct nn.Linear swap |
pip install -e . |
| MLX | MLXGearbxModel |
Per-token | mlx.nn.QuantizedLinear swap |
pip install -e ".[mlx]" |
| Ollama | OllamaGearbx |
Per-prompt | Black-box HTTP (any OpenAI-compat server) | pip install -e . |
Transformers backend: Full per-token gear shifting with real weight swaps. Supports MPS, CUDA, and CPU.
MLX backend: Native Apple Silicon via mlx-lm. Unified memory means no CPU offloading. Supports telemetry-only mode for pre-quantized models. Median-based calibration.
Ollama backend: Per-prompt routing via prompt classifier. Talks to Ollama, llama.cpp, vLLM, LM Studio, or any OpenAI-compatible local server over HTTP. Real per-token entropy telemetry via top_logprobs. Supports SSE streaming with StreamChunk per-token telemetry.
Device Support
| Device | Loading Strategy | Base Precision | Acceleration | Notes |
|---|---|---|---|---|
| MPS (Apple Silicon) | fp16 direct | float16 | MPS native Metal shaders | M1/M2/M3/M4, unified memory |
| CUDA (NVIDIA) | 4-bit NF4 via bitsandbytes | INT4 | Fused Triton kernels | Requires bitsandbytes |
| MLX (Apple Silicon) | mlx-lm native | fp16/bf16 | MLX graph compilation | Requires mlx, mlx-lm |
| CPU | fp32 direct | float32 | NEON vectorized (ARM) | Testing or lightweight models |
Quick Start
Transformers Backend
from gearbx import GearbxModel
# Device auto-detected: MPS > CUDA > CPU
gbm = GearbxModel(
'mistralai/Mistral-7B-Instruct-v0.3',
num_attn_layers_to_manage=16,
high_thresh=3.5,
low_thresh=1.8,
)
r = gbm.generate('What is the capital of Japan?', max_new_tokens=30)
print(r.text)
print(r.gear_stats) # e.g. {'low': 0.80, 'mid': 0.20}
MLX Backend (Apple Silicon)
from gearbx import MLXGearbxModel
gbm = MLXGearbxModel('mlx-community/Mistral-7B-Instruct-v0.3-4bit')
r = gbm.generate('Explain entropy in information theory.', max_new_tokens=200)
print(r.text)
print(r.gear_stats)
print(f'{r.tokens_per_sec:.1f} tok/s')
Ollama Backend
from gearbx import OllamaGearbx
gbm = OllamaGearbx(
gear_models={
'low': 'mistral:7b',
'mid': 'mistral:7b',
'high': 'mistral:7b',
},
low_thresh=0.4,
high_thresh=1.2,
)
# Streaming with per-token telemetry
for chunk in gbm.generate_stream('Prove sqrt(2) is irrational.', max_new_tokens=200):
print(chunk.token, end='', flush=True)
GGUF via Ollama Cache
# Load GGUF from Ollama's local blob cache (no HF download)
python3 run_gguf.py llama3.2:1b
Installation
Apple Silicon (MPS)
python3 -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
python3 -c "import torch; print('MPS:', torch.backends.mps.is_available())"
Apple Silicon (MLX)
python3 -m venv .venv && source .venv/bin/activate
pip install -e ".[mlx,dev]"
NVIDIA (CUDA)
python3 -m venv .venv && source .venv/bin/activate
pip install -e ".[cuda,dev]"
python3 -c "import torch; print('CUDA:', torch.cuda.is_available())"
TUI (Terminal UI)
# From source
cd tui && make build
# Via npm (downloads prebuilt binary)
npm install -g gearbx
gearbx
Requirements
- Python >= 3.10
- PyTorch >= 2.1.0
- transformers >= 4.40.0
- accelerate >= 0.28.0
- bitsandbytes >= 0.43.0 (CUDA only, optional)
- mlx >= 0.12.0, mlx-lm >= 0.12.0 (MLX backend, optional)
Hardware Requirements
| Target Model | Memory Required | Apple Silicon | NVIDIA GPU |
|---|---|---|---|
| Phi-3 Mini 3.8B | 4-8 GB | Any M-series | RTX 3060 |
| Mistral 7B | 8-14 GB | M1 Pro+ / M2+ | RTX 3070 |
| LLaMA-3 8B | 10-16 GB | M1 Max+ / M2 Pro+ | RTX 3080 |
| LLaMA-3 8B (full dual) | 16-20 GB | M2 Max+ / M3 Pro+ | RTX 4070 |
Running Tests
# All unit tests (no GPU needed)
pytest tests/ -v
# Individual test files
pytest tests/test_monitor.py -v
pytest tests/test_precision.py -v
pytest tests/test_integration.py -v
Benchmarks
All benchmarks support --unit mode (synthetic data, CPU) and full mode (real model, MPS/CUDA).
# Unit mode, no GPU needed
python3 benchmarks/bench_entropy.py --unit
python3 benchmarks/bench_latency.py --unit
python3 benchmarks/bench_quality.py --unit
# Full mode, auto-detects MPS or CUDA
python3 benchmarks/bench_entropy.py --model mistralai/Mistral-7B-Instruct-v0.3
python3 benchmarks/bench_latency.py --model mistralai/Mistral-7B-Instruct-v0.3
python3 benchmarks/bench_quality.py --model mistralai/Mistral-7B-Instruct-v0.3 --n 200
# Production benchmark
python3 benchmarks/bench_production.py
Benchmark Targets
| Benchmark | Metric | Target |
|---|---|---|
| Entropy Signal | Separation ratio (hard/trivial) | > 2.0x |
| Gear Shift Latency | Time per shift() call | < 0.5 ms |
| Monitor Throughput | Time per update() call | < 1.0 ms (GPU/MPS) |
| Hook Overhead | Added latency per forward pass | < 0.5 ms (GPU/MPS) |
| GSM8K Accuracy | vs. static 8-bit baseline | within 2pp |
Threshold Tuning
Thresholds auto-scale by log2(vocab_size) / log2(32768) when vocab_size is provided (GearbxModel does this automatically).
Auto-calibration: calibrate_from_logits() computes entropy from a prefill pass and sets thresholds at p40/p60 of the distribution. Post-calibration caps prevent inflated thresholds: high ≤ log2(vocab) * 0.18, low ≤ log2(vocab) * 0.06.
Suggested Thresholds by Model
| Model | Vocab Size | low_thresh | high_thresh |
|---|---|---|---|
| LLaMA-3 8B / 70B | 128,256 | 2.2 | 4.5 |
| Mistral 7B v0.3 | 32,768 | 1.8 | 3.5 |
| Qwen2 7B | 151,936 | 2.5 | 5.0 |
| Phi-3 Mini | 32,064 | 1.7 | 3.4 |
| Gemma 2 9B | 256,000 | 2.8 | 5.5 |
For Ollama backend, top-k truncated entropy has a narrower range, use lower thresholds (e.g., low=0.4, high=1.2).
Known Limitations
-
Single-Stream Only: Designed for single-sequence local inference, not batched serving.
generate_batch()runs prompts sequentially with per-sequence gear routing. -
KV Cache Precision Mismatch: Mid-sequence gear shifts create mixed-precision KV cache entries. Use
min_gear_duration=4to reduce noise, oruse_cache=Falseduring strict validation. -
MPS fp16 Base: On Apple Silicon (transformers backend), the base model runs at fp16 since bitsandbytes doesn't support MPS. Memory usage is higher than CUDA 4-bit but entropy routing and quality benefits still apply.
-
Double Quantization: Loading GGUF→fp16→int4 compounds quantization error. PrecisionManager warns when detected. Consider higher bit-width for already-quantized source models.
-
Ollama Coarse Routing: Ollama backend routes per-prompt (not per-token) since the server is a black box. Entropy telemetry is still per-token.
-
Catastrophic Entropy: If entropy exceeds 90% of theoretical max for 2+ consecutive tokens, generation auto-falls-back to mid gear. Catches model failures from bad quantization.
License
Proprietary - Jpdz Labs. All rights reserved.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file gearbx-1.0.1.tar.gz.
File metadata
- Download URL: gearbx-1.0.1.tar.gz
- Upload date:
- Size: 50.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
23650ead589a7729ab4ac78ee2231756ca89bd0db2b0d07aea50e7138bbf5f3b
|
|
| MD5 |
99f93e1ffc29b431501586dd6d7ad533
|
|
| BLAKE2b-256 |
02ab3add73b43b0c956dc4572417d28343e7b33ed07b11803d484c0b7f0f842e
|