Complexity + VRAM-aware routing for local dual-tier LLM deployments
Project description
gemma4-adaptive-router
Complexity + VRAM-aware routing for local dual-tier LLM deployments.
The only public implementation of sub-millisecond complexity scoring coupled to real-time VRAM scheduling for consumer GPU on-prem inference.
Install
pip install gemma4-adaptive-router
Quickstart
from adaptive_router import AdaptiveRouter, RoutingConfig
import time
config = RoutingConfig(
complexity_threshold=0.65, # score >= this → tier_high
vram_headroom_gb=1.5, # free VRAM below this → force tier_low
latency_sla_ms=2000.0, # EMA latency above this → force tier_low
sla_warmup_seed_ms=800.0, # seed EMA to avoid cold-start burst on tier_high
)
router = AdaptiveRouter(config)
# Route a query
tier = router.route("Explain the proof of Fermat's Last Theorem step by step")
# → "tier_high" (complex math query, VRAM available)
# After you get the response, report the actual latency so the SLA rule adapts
t0 = time.monotonic()
# ... call your LLM endpoint ...
router.observe(tier, latency_ms=(time.monotonic() - t0) * 1000)
# Clean up background VRAM monitor thread
router.shutdown()
Architecture
┌─────────────────────────────────────────────────────────────┐
│ adaptive_router/ │
│ │
│ Layer 1: ComplexityScorer │
│ ├── Rule-based, 6 dimensions, sub-ms, zero external calls │
│ ├── math (0.25) · code (0.25) · depth (0.20) │
│ ├── tokens (0.15) · entities (0.10) · negation (0.05) │
│ └── Output: score in [0.0, 1.0] │
│ │
│ Layer 2: VRAMMonitor (daemon thread) │
│ ├── pynvml direct — no subprocess nvidia-smi overhead │
│ ├── Polling at 10-20ms with atomic shared state │
│ └── Thread-safe VRAMState (free_gb, used_gb, util_pct) │
│ │
│ Layer 3: RoutingDecision (chain of rules) │
│ ├── complexity_rule: score < threshold → tier_low │
│ ├── vram_rule: free_gb < headroom → tier_low │
│ ├── sla_rule: EMA latency > SLA → tier_low │
│ └── Default fallthrough: tier_high │
└─────────────────────────────────────────────────────────────┘
Configuration reference
| Field | Type | Default | Effect |
|---|---|---|---|
complexity_threshold |
float |
0.65 |
Score ≥ this routes to tier_high |
vram_headroom_gb |
float |
1.5 |
Free VRAM below this forces tier_low |
latency_sla_ms |
float |
2000.0 |
EMA latency above this forces tier_low |
vram_poll_interval_ms |
int |
15 |
VRAM polling frequency |
sla_warmup_seed_ms |
float |
0.0 |
EMA seed at startup — set to p50 of tier_high to avoid cold-start burst |
Load from YAML:
from adaptive_router.config import load_config
config = load_config("router_config.yaml")
# router_config.yaml
complexity_threshold: 0.65
vram_headroom_gb: 1.5
latency_sla_ms: 2000.0
vram_poll_interval_ms: 15
sla_warmup_seed_ms: 800.0
Deploy as FastAPI proxy
python -m adaptive_router.middleware \
--tier-high-url http://llama-cpp:8080/v1 \
--tier-low-url http://vllm:8000/v1 \
--port 9000
Exposes /v1/chat/completions (proxied), /health, and /metrics (Prometheus).
Custom routing rules
from adaptive_router import AdaptiveRouter, RoutingConfig, RouterState
def my_rule(query: str, state: RouterState):
# Force tier_high for any query mentioning "contract"
if "contract" in query.lower():
return "tier_high"
return None # abstain, let next rule decide
router = AdaptiveRouter(config, rules=[my_rule])
Known limitations
- 16GB does not fit 26B in vLLM natively. The router mitigates this by routing complex queries to llama.cpp. The real fix for 200 concurrent users wanting 26B is dual-GPU.
- Blackwell SM120 rough edges. FP8 KV and MTP require specific workarounds. See VRAM-REALITY-CHECK.md.
- EXL2 is not designed for multi-user production. TabbyAPI maintainers state this explicitly. It's in benchmarks for completeness only.
- SGLang does not always win. RadixAttention benefits depend on prefix overlap. See RADIXATTENTION-OPERATIVE-TABLE.md.
- MTP in SM120 is not plug-and-play. With BF16 + workarounds it works; with NVFP4 it produces garbage.
- Cold-start burst on tier_high. With
sla_warmup_seed_ms=0.0, the EMA starts at 0ms — below any SLA. The first ~15-20 complex queries all hit tier_high before the EMA reflects real latency. Setsla_warmup_seed_msto your expected p50. - This is v0.1. Functional and tested, but not battle-tested at thousands of users. It's the starting point.
Running tests
pip install -e ".[dev]"
pytest tests/ -v # No GPU required — pynvml is mocked in all fixtures
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file gemma4_adaptive_router-0.1.0.tar.gz.
File metadata
- Download URL: gemma4_adaptive_router-0.1.0.tar.gz
- Upload date:
- Size: 11.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7f94556b984b2ec46e97360e648c923945429edcf09f6cf5d6bdec1e640748e1
|
|
| MD5 |
1e1968d77fd6f17ce16a843fe71c62a7
|
|
| BLAKE2b-256 |
38e685f909c9c364ce0c2ea7385877fc511641ce4f5639f1c9c8cda10a40de68
|
File details
Details for the file gemma4_adaptive_router-0.1.0-py3-none-any.whl.
File metadata
- Download URL: gemma4_adaptive_router-0.1.0-py3-none-any.whl
- Upload date:
- Size: 3.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f671fe417e088f3b63d3ec795a2ba8a0e189217044dba3025fbdc01a5a833aea
|
|
| MD5 |
c0cc9c72b674238f2ddd4d4a3096dcf6
|
|
| BLAKE2b-256 |
3c4188c3b484fd306e5f487200d303f773744069797b0188cad33b0235134a7e
|