Entropy-Routed Dynamic Quantization for LLM Inference

These details have not been verified by PyPI

Project links

Homepage

Project description

Gearbx

Gearbx: Entropy-Routed Dynamic Quantization

An entropy-routed dynamic quantization engine for LLM inference that adjusts weight precision on a per-token basis during generation.

Website: gearbx.jpdz.app

How It Works

Not all tokens require the same computational fidelity. Producing "Hello, how can I help you?" demands almost no semantic reasoning, while the next token in a partial differential equation imposes high cognitive load on the model.

Gearbx monitors the model's own output entropy at each generation step and routes subsequent forward passes through pre-cached layer weights at the appropriate precision tier:

Gear	Precision	Memory per Param	When
`low`	4-bit packed	0.5 bytes (25% of fp16)	Filler tokens, greetings, boilerplate
`mid`	8-bit packed	1 byte (50% of fp16)	Standard reasoning, moderate difficulty
`high`	fp16	2 bytes (original)	Complex reasoning, math, proofs

Autoregressive LLM inference is memory-bandwidth bound: each generated token loads the full weight matrix. Packed storage means fewer bytes transferred per token, translating directly to speed gains proportional to compression ratio.

Architecture

Prompt → [PromptClassifier] → initial_gear
                │
    ┌───────────▼──────────────────────────┐
    │       GENERATION LOOP                │
    │                                      │
    │  input_ids → model.forward() → logits│
    │                    │                 │
    │          [EntropyMonitor]            │
    │      Shannon entropy → gear decision │
    │           (rolling avg + hysteresis) │
    │                    │                 │
    │         [PrecisionManager]           │
    │     lazy quantize + module swap      │
    │     (originals offloaded to CPU)     │
    │                    │                 │
    │          sample next_token           │
    └──────────────────────────────────────┘
                │
    generated_ids → decode → output_text

Core Components

EntropyMonitor (monitor.py: Computes Shannon entropy (bits, log base 2) from logit distributions. Rolling-window average smooths gear transitions. Supports vocab-size auto-scaling (reference: 32768 tokens), hysteresis to prevent oscillation near boundaries, and auto-calibration from observed entropy distribution (p40/p60 thresholds). Deferred pipeline mode eliminates per-token GPU→CPU sync.
PrecisionManager (precision.py: Lazy quantization with direct module-reference swaps. Discovers attention layers at init via detect_attn_prefixes() but defers quantization to first shift() call. On shift: quantizes from originals, swaps quantized module into model tree, offloads originals to CPU. Only one quantized variant exists at any time, memory goes down, not up.
QuantizedLinear Kernels (kernels.py: Real packed integer storage: int8 (per-channel symmetric), int6, int4 (2 values per byte), int2 (4 values per byte, ternary {-1,0,1}). Dual-path forward: fused Triton kernels on CUDA (in-register dequant, no intermediate tensor), dequant-cache path on MPS/CPU (first forward unpacks to fp16, subsequent forwards reuse cached tensor).
Native Acceleration (native_kernels.py + csrc/: Three tiers below Triton: MPS native Metal shaders via PyTorch's MetalShaderLibrary (zero-copy, threadgroup 256), legacy Metal command buffers, and CPU NEON vectorized matmul (ARM -O3 -march=armv8.2-a+fp16).
Fused Triton Kernels (triton_kernels.py: CUDA-only fused quantized matmul. Multiplies activations directly with packed int8/int4/int2 weights without materializing fp16. In-register dequantization - only packed data traverses global memory.
GearbxModel (model.py: Orchestrates everything. Loads model via transformers, auto-detects attention architecture, passes vocab_size to EntropyMonitor for threshold scaling. Manual autoregressive loop with KV caching and NaN guards for MPS stability.
StatisticalPromptClassifier (classifier.py: Cold-start heuristic. Scores prompt complexity from math symbols, code keywords, length, avg word length → initial gear before the entropy window fills.

Backends

Backend	Module	Routing	Weight Access	Install
Transformers	`GearbxModel`	Per-token	Direct `nn.Linear` swap	`pip install -e .`
MLX	`MLXGearbxModel`	Per-token	`mlx.nn.QuantizedLinear` swap	`pip install -e ".[mlx]"`
Ollama	`OllamaGearbx`	Per-prompt	Black-box HTTP (any OpenAI-compat server)	`pip install -e .`

Transformers backend: Full per-token gear shifting with real weight swaps. Supports MPS, CUDA, and CPU.

MLX backend: Native Apple Silicon via mlx-lm. Unified memory means no CPU offloading. Supports telemetry-only mode for pre-quantized models. Median-based calibration.

Ollama backend: Per-prompt routing via prompt classifier. Talks to Ollama, llama.cpp, vLLM, LM Studio, or any OpenAI-compatible local server over HTTP. Real per-token entropy telemetry via top_logprobs. Supports SSE streaming with StreamChunk per-token telemetry.

Device Support

Device	Loading Strategy	Base Precision	Acceleration	Notes
MPS (Apple Silicon)	fp16 direct	float16	MPS native Metal shaders	M1/M2/M3/M4, unified memory
CUDA (NVIDIA)	4-bit NF4 via bitsandbytes	INT4	Fused Triton kernels	Requires bitsandbytes
MLX (Apple Silicon)	mlx-lm native	fp16/bf16	MLX graph compilation	Requires mlx, mlx-lm
CPU	fp32 direct	float32	NEON vectorized (ARM)	Testing or lightweight models

Quick Start

Transformers Backend

from gearbx import GearbxModel

# Device auto-detected: MPS > CUDA > CPU
gbm = GearbxModel(
    'mistralai/Mistral-7B-Instruct-v0.3',
    num_attn_layers_to_manage=16,
    high_thresh=3.5,
    low_thresh=1.8,
)

r = gbm.generate('What is the capital of Japan?', max_new_tokens=30)
print(r.text)
print(r.gear_stats)  # e.g. {'low': 0.80, 'mid': 0.20}

MLX Backend (Apple Silicon)

from gearbx import MLXGearbxModel

gbm = MLXGearbxModel('mlx-community/Mistral-7B-Instruct-v0.3-4bit')
r = gbm.generate('Explain entropy in information theory.', max_new_tokens=200)
print(r.text)
print(r.gear_stats)
print(f'{r.tokens_per_sec:.1f} tok/s')

Ollama Backend

from gearbx import OllamaGearbx

gbm = OllamaGearbx(
    gear_models={
        'low':  'mistral:7b',
        'mid':  'mistral:7b',
        'high': 'mistral:7b',
    },
    low_thresh=0.4,
    high_thresh=1.2,
)

# Streaming with per-token telemetry
for chunk in gbm.generate_stream('Prove sqrt(2) is irrational.', max_new_tokens=200):
    print(chunk.token, end='', flush=True)

GGUF via Ollama Cache

# Load GGUF from Ollama's local blob cache (no HF download)
python3 run_gguf.py llama3.2:1b

Installation

Apple Silicon (MPS)

python3 -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"

python3 -c "import torch; print('MPS:', torch.backends.mps.is_available())"

Apple Silicon (MLX)

python3 -m venv .venv && source .venv/bin/activate
pip install -e ".[mlx,dev]"

NVIDIA (CUDA)

python3 -m venv .venv && source .venv/bin/activate
pip install -e ".[cuda,dev]"

python3 -c "import torch; print('CUDA:', torch.cuda.is_available())"

TUI (Terminal UI)

# From source
cd tui && make build

# Via npm (downloads prebuilt binary)
npm install -g gearbx
gearbx

Requirements

Python >= 3.10
PyTorch >= 2.1.0
transformers >= 4.40.0
accelerate >= 0.28.0
bitsandbytes >= 0.43.0 (CUDA only, optional)
mlx >= 0.12.0, mlx-lm >= 0.12.0 (MLX backend, optional)

Hardware Requirements

Target Model	Memory Required	Apple Silicon	NVIDIA GPU
Phi-3 Mini 3.8B	4-8 GB	Any M-series	RTX 3060
Mistral 7B	8-14 GB	M1 Pro+ / M2+	RTX 3070
LLaMA-3 8B	10-16 GB	M1 Max+ / M2 Pro+	RTX 3080
LLaMA-3 8B (full dual)	16-20 GB	M2 Max+ / M3 Pro+	RTX 4070

Running Tests

# All unit tests (no GPU needed)
pytest tests/ -v

# Individual test files
pytest tests/test_monitor.py -v
pytest tests/test_precision.py -v
pytest tests/test_integration.py -v

Benchmarks

All benchmarks support --unit mode (synthetic data, CPU) and full mode (real model, MPS/CUDA).

# Unit mode, no GPU needed
python3 benchmarks/bench_entropy.py --unit
python3 benchmarks/bench_latency.py --unit
python3 benchmarks/bench_quality.py --unit

# Full mode, auto-detects MPS or CUDA
python3 benchmarks/bench_entropy.py --model mistralai/Mistral-7B-Instruct-v0.3
python3 benchmarks/bench_latency.py --model mistralai/Mistral-7B-Instruct-v0.3
python3 benchmarks/bench_quality.py --model mistralai/Mistral-7B-Instruct-v0.3 --n 200

# Production benchmark
python3 benchmarks/bench_production.py

Benchmark Targets

Benchmark	Metric	Target
Entropy Signal	Separation ratio (hard/trivial)	> 2.0x
Gear Shift Latency	Time per shift() call	< 0.5 ms
Monitor Throughput	Time per update() call	< 1.0 ms (GPU/MPS)
Hook Overhead	Added latency per forward pass	< 0.5 ms (GPU/MPS)
GSM8K Accuracy	vs. static 8-bit baseline	within 2pp

Threshold Tuning

Thresholds auto-scale by log2(vocab_size) / log2(32768) when vocab_size is provided (GearbxModel does this automatically).

Auto-calibration: calibrate_from_logits() computes entropy from a prefill pass and sets thresholds at p40/p60 of the distribution. Post-calibration caps prevent inflated thresholds: high ≤ log2(vocab) * 0.18, low ≤ log2(vocab) * 0.06.

Suggested Thresholds by Model

Model	Vocab Size	low_thresh	high_thresh
LLaMA-3 8B / 70B	128,256	2.2	4.5
Mistral 7B v0.3	32,768	1.8	3.5
Qwen2 7B	151,936	2.5	5.0
Phi-3 Mini	32,064	1.7	3.4
Gemma 2 9B	256,000	2.8	5.5

For Ollama backend, top-k truncated entropy has a narrower range, use lower thresholds (e.g., low=0.4, high=1.2).

Known Limitations

Single-Stream Only: Designed for single-sequence local inference, not batched serving. generate_batch() runs prompts sequentially with per-sequence gear routing.
KV Cache Precision Mismatch: Mid-sequence gear shifts create mixed-precision KV cache entries. Use min_gear_duration=4 to reduce noise, or use_cache=False during strict validation.
MPS fp16 Base: On Apple Silicon (transformers backend), the base model runs at fp16 since bitsandbytes doesn't support MPS. Memory usage is higher than CUDA 4-bit but entropy routing and quality benefits still apply.
Double Quantization: Loading GGUF→fp16→int4 compounds quantization error. PrecisionManager warns when detected. Consider higher bit-width for already-quantized source models.
Ollama Coarse Routing: Ollama backend routes per-prompt (not per-token) since the server is a black box. Entropy telemetry is still per-token.
Catastrophic Entropy: If entropy exceeds 90% of theoretical max for 2+ consecutive tokens, generation auto-falls-back to mid gear. Catches model failures from bad quantization.

License

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

1.0.4

May 29, 2026

1.0.3

May 29, 2026

This version

1.0.2

May 29, 2026

1.0.1

May 29, 2026

1.0.0

May 28, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gearbx-1.0.2.tar.gz (54.2 kB view details)

Uploaded May 29, 2026 Source

File details

Details for the file gearbx-1.0.2.tar.gz.

File metadata

Download URL: gearbx-1.0.2.tar.gz
Upload date: May 29, 2026
Size: 54.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.7

File hashes

Hashes for gearbx-1.0.2.tar.gz
Algorithm	Hash digest
SHA256	`14a034b5478fc7e485383bc0374e90ab6a4b58d17fb714b8b0dd291adf2749ec`
MD5	`6f7484d5e0903f7387443df1c88b22eb`
BLAKE2b-256	`e78ded8d03ddda869ef80d93506e82fd34c9cc3b985dcb333a9b855dff93b280`

See more details on using hashes here.

gearbx 1.0.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Gearbx: Entropy-Routed Dynamic Quantization

How It Works

Architecture

Core Components

Backends

Device Support

Quick Start

Transformers Backend

MLX Backend (Apple Silicon)

Ollama Backend

GGUF via Ollama Cache

Installation

Apple Silicon (MPS)

Apple Silicon (MLX)

NVIDIA (CUDA)

TUI (Terminal UI)

Requirements

Hardware Requirements

Running Tests

Benchmarks

Benchmark Targets

Threshold Tuning

Suggested Thresholds by Model

Known Limitations

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes