Skip to main content

Entropy-Routed Dynamic Quantization for LLM Inference

Project description

Gearbx

Gearbx: Entropy-Routed Dynamic Quantization

An entropy-routed dynamic quantization engine for LLM inference that adjusts weight precision on a per-token basis during generation.

Website: gearbx.jpdz.app

How It Works

Not all tokens require the same computational fidelity. Producing "Hello, how can I help you?" demands almost no semantic reasoning, while the next token in a partial differential equation imposes high cognitive load on the model.

Gearbx monitors the model's own output entropy at each generation step and routes subsequent forward passes through pre-cached layer weights at the appropriate precision tier:

Gear Precision Memory per Param When
low 4-bit packed 0.5 bytes (25% of fp16) Filler tokens, greetings, boilerplate
mid 8-bit packed 1 byte (50% of fp16) Standard reasoning, moderate difficulty
high fp16 2 bytes (original) Complex reasoning, math, proofs

Autoregressive LLM inference is memory-bandwidth bound: each generated token loads the full weight matrix. Packed storage means fewer bytes transferred per token, translating directly to speed gains proportional to compression ratio.

Architecture

Prompt → [PromptClassifier] → initial_gear
                │
    ┌───────────▼──────────────────────────┐
    │       GENERATION LOOP                │
    │                                      │
    │  input_ids → model.forward() → logits│
    │                    │                 │
    │          [EntropyMonitor]            │
    │      Shannon entropy → gear decision │
    │           (rolling avg + hysteresis) │
    │                    │                 │
    │         [PrecisionManager]           │
    │     lazy quantize + module swap      │
    │     (originals offloaded to CPU)     │
    │                    │                 │
    │          sample next_token           │
    └──────────────────────────────────────┘
                │
    generated_ids → decode → output_text

Core Components

  • EntropyMonitor (monitor.py: Computes Shannon entropy (bits, log base 2) from logit distributions. Rolling-window average smooths gear transitions. Supports vocab-size auto-scaling (reference: 32768 tokens), hysteresis to prevent oscillation near boundaries, and auto-calibration from observed entropy distribution (p40/p60 thresholds). Deferred pipeline mode eliminates per-token GPU→CPU sync.

  • PrecisionManager (precision.py: Lazy quantization with direct module-reference swaps. Discovers attention layers at init via detect_attn_prefixes() but defers quantization to first shift() call. On shift: quantizes from originals, swaps quantized module into model tree, offloads originals to CPU. Only one quantized variant exists at any time, memory goes down, not up.

  • QuantizedLinear Kernels (kernels.py: Real packed integer storage: int8 (per-channel symmetric), int6, int4 (2 values per byte), int2 (4 values per byte, ternary {-1,0,1}). Dual-path forward: fused Triton kernels on CUDA (in-register dequant, no intermediate tensor), dequant-cache path on MPS/CPU (first forward unpacks to fp16, subsequent forwards reuse cached tensor).

  • Native Acceleration (native_kernels.py + csrc/: Three tiers below Triton: MPS native Metal shaders via PyTorch's MetalShaderLibrary (zero-copy, threadgroup 256), legacy Metal command buffers, and CPU NEON vectorized matmul (ARM -O3 -march=armv8.2-a+fp16).

  • Fused Triton Kernels (triton_kernels.py: CUDA-only fused quantized matmul. Multiplies activations directly with packed int8/int4/int2 weights without materializing fp16. In-register dequantization - only packed data traverses global memory.

  • GearbxModel (model.py: Orchestrates everything. Loads model via transformers, auto-detects attention architecture, passes vocab_size to EntropyMonitor for threshold scaling. Manual autoregressive loop with KV caching and NaN guards for MPS stability.

  • StatisticalPromptClassifier (classifier.py: Cold-start heuristic. Scores prompt complexity from math symbols, code keywords, length, avg word length → initial gear before the entropy window fills.

Backends

Backend Module Routing Weight Access Install
Transformers GearbxModel Per-token Direct nn.Linear swap pip install -e .
MLX MLXGearbxModel Per-token mlx.nn.QuantizedLinear swap pip install -e ".[mlx]"
Ollama OllamaGearbx Per-prompt Black-box HTTP (any OpenAI-compat server) pip install -e .

Transformers backend: Full per-token gear shifting with real weight swaps. Supports MPS, CUDA, and CPU.

MLX backend: Native Apple Silicon via mlx-lm. Unified memory means no CPU offloading. Supports telemetry-only mode for pre-quantized models. Median-based calibration.

Ollama backend: Per-prompt routing via prompt classifier. Talks to Ollama, llama.cpp, vLLM, LM Studio, or any OpenAI-compatible local server over HTTP. Real per-token entropy telemetry via top_logprobs. Supports SSE streaming with StreamChunk per-token telemetry.

Device Support

Device Loading Strategy Base Precision Acceleration Notes
MPS (Apple Silicon) fp16 direct float16 MPS native Metal shaders M1/M2/M3/M4, unified memory
CUDA (NVIDIA) 4-bit NF4 via bitsandbytes INT4 Fused Triton kernels Requires bitsandbytes
MLX (Apple Silicon) mlx-lm native fp16/bf16 MLX graph compilation Requires mlx, mlx-lm
CPU fp32 direct float32 NEON vectorized (ARM) Testing or lightweight models

Quick Start

Transformers Backend

from gearbx import GearbxModel

# Device auto-detected: MPS > CUDA > CPU
gbm = GearbxModel(
    'mistralai/Mistral-7B-Instruct-v0.3',
    num_attn_layers_to_manage=16,
    high_thresh=3.5,
    low_thresh=1.8,
)

r = gbm.generate('What is the capital of Japan?', max_new_tokens=30)
print(r.text)
print(r.gear_stats)  # e.g. {'low': 0.80, 'mid': 0.20}

MLX Backend (Apple Silicon)

from gearbx import MLXGearbxModel

gbm = MLXGearbxModel('mlx-community/Mistral-7B-Instruct-v0.3-4bit')
r = gbm.generate('Explain entropy in information theory.', max_new_tokens=200)
print(r.text)
print(r.gear_stats)
print(f'{r.tokens_per_sec:.1f} tok/s')

Ollama Backend

from gearbx import OllamaGearbx

gbm = OllamaGearbx(
    gear_models={
        'low':  'mistral:7b',
        'mid':  'mistral:7b',
        'high': 'mistral:7b',
    },
    low_thresh=0.4,
    high_thresh=1.2,
)

# Streaming with per-token telemetry
for chunk in gbm.generate_stream('Prove sqrt(2) is irrational.', max_new_tokens=200):
    print(chunk.token, end='', flush=True)

GGUF via Ollama Cache

# Load GGUF from Ollama's local blob cache (no HF download)
python3 run_gguf.py llama3.2:1b

Installation

Apple Silicon (MPS)

python3 -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"

python3 -c "import torch; print('MPS:', torch.backends.mps.is_available())"

Apple Silicon (MLX)

python3 -m venv .venv && source .venv/bin/activate
pip install -e ".[mlx,dev]"

NVIDIA (CUDA)

python3 -m venv .venv && source .venv/bin/activate
pip install -e ".[cuda,dev]"

python3 -c "import torch; print('CUDA:', torch.cuda.is_available())"

TUI (Terminal UI)

# From source
cd tui && make build

# Via npm (downloads prebuilt binary)
npm install -g gearbx
gearbx

Requirements

  • Python >= 3.10
  • PyTorch >= 2.1.0
  • transformers >= 4.40.0
  • accelerate >= 0.28.0
  • bitsandbytes >= 0.43.0 (CUDA only, optional)
  • mlx >= 0.12.0, mlx-lm >= 0.12.0 (MLX backend, optional)

Hardware Requirements

Target Model Memory Required Apple Silicon NVIDIA GPU
Phi-3 Mini 3.8B 4-8 GB Any M-series RTX 3060
Mistral 7B 8-14 GB M1 Pro+ / M2+ RTX 3070
LLaMA-3 8B 10-16 GB M1 Max+ / M2 Pro+ RTX 3080
LLaMA-3 8B (full dual) 16-20 GB M2 Max+ / M3 Pro+ RTX 4070

Running Tests

# All unit tests (no GPU needed)
pytest tests/ -v

# Individual test files
pytest tests/test_monitor.py -v
pytest tests/test_precision.py -v
pytest tests/test_integration.py -v

Benchmarks

All benchmarks support --unit mode (synthetic data, CPU) and full mode (real model, MPS/CUDA).

# Unit mode, no GPU needed
python3 benchmarks/bench_entropy.py --unit
python3 benchmarks/bench_latency.py --unit
python3 benchmarks/bench_quality.py --unit

# Full mode, auto-detects MPS or CUDA
python3 benchmarks/bench_entropy.py --model mistralai/Mistral-7B-Instruct-v0.3
python3 benchmarks/bench_latency.py --model mistralai/Mistral-7B-Instruct-v0.3
python3 benchmarks/bench_quality.py --model mistralai/Mistral-7B-Instruct-v0.3 --n 200

# Production benchmark
python3 benchmarks/bench_production.py

Benchmark Targets

Benchmark Metric Target
Entropy Signal Separation ratio (hard/trivial) > 2.0x
Gear Shift Latency Time per shift() call < 0.5 ms
Monitor Throughput Time per update() call < 1.0 ms (GPU/MPS)
Hook Overhead Added latency per forward pass < 0.5 ms (GPU/MPS)
GSM8K Accuracy vs. static 8-bit baseline within 2pp

Threshold Tuning

Thresholds auto-scale by log2(vocab_size) / log2(32768) when vocab_size is provided (GearbxModel does this automatically).

Auto-calibration: calibrate_from_logits() computes entropy from a prefill pass and sets thresholds at p40/p60 of the distribution. Post-calibration caps prevent inflated thresholds: high ≤ log2(vocab) * 0.18, low ≤ log2(vocab) * 0.06.

Suggested Thresholds by Model

Model Vocab Size low_thresh high_thresh
LLaMA-3 8B / 70B 128,256 2.2 4.5
Mistral 7B v0.3 32,768 1.8 3.5
Qwen2 7B 151,936 2.5 5.0
Phi-3 Mini 32,064 1.7 3.4
Gemma 2 9B 256,000 2.8 5.5

For Ollama backend, top-k truncated entropy has a narrower range, use lower thresholds (e.g., low=0.4, high=1.2).

Known Limitations

  1. Single-Stream Only: Designed for single-sequence local inference, not batched serving. generate_batch() runs prompts sequentially with per-sequence gear routing.

  2. KV Cache Precision Mismatch: Mid-sequence gear shifts create mixed-precision KV cache entries. Use min_gear_duration=4 to reduce noise, or use_cache=False during strict validation.

  3. MPS fp16 Base: On Apple Silicon (transformers backend), the base model runs at fp16 since bitsandbytes doesn't support MPS. Memory usage is higher than CUDA 4-bit but entropy routing and quality benefits still apply.

  4. Double Quantization: Loading GGUF→fp16→int4 compounds quantization error. PrecisionManager warns when detected. Consider higher bit-width for already-quantized source models.

  5. Ollama Coarse Routing: Ollama backend routes per-prompt (not per-token) since the server is a black box. Entropy telemetry is still per-token.

  6. Catastrophic Entropy: If entropy exceeds 90% of theoretical max for 2+ consecutive tokens, generation auto-falls-back to mid gear. Catches model failures from bad quantization.

License

Proprietary - Jpdz Labs. All rights reserved.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gearbx-1.0.2.tar.gz (54.2 kB view details)

Uploaded Source

File details

Details for the file gearbx-1.0.2.tar.gz.

File metadata

  • Download URL: gearbx-1.0.2.tar.gz
  • Upload date:
  • Size: 54.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.7

File hashes

Hashes for gearbx-1.0.2.tar.gz
Algorithm Hash digest
SHA256 14a034b5478fc7e485383bc0374e90ab6a4b58d17fb714b8b0dd291adf2749ec
MD5 6f7484d5e0903f7387443df1c88b22eb
BLAKE2b-256 e78ded8d03ddda869ef80d93506e82fd34c9cc3b985dcb333a9b855dff93b280

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page