Complexity + VRAM-aware routing for local dual-tier LLM deployments

These details have not been verified by PyPI

Project links

Project description

gemma4-adaptive-router

Complexity + VRAM-aware routing for local dual-tier LLM deployments.
The only public implementation of sub-millisecond complexity scoring coupled to real-time VRAM scheduling for consumer GPU on-prem inference.

Install

pip install gemma4-adaptive-router

Quickstart

from adaptive_router import AdaptiveRouter, RoutingConfig
import time

config = RoutingConfig(
    complexity_threshold=0.65,   # score >= this → tier_high
    vram_headroom_gb=1.5,        # free VRAM below this → force tier_low
    latency_sla_ms=2000.0,       # EMA latency above this → force tier_low
    sla_warmup_seed_ms=800.0,    # seed EMA to avoid cold-start burst on tier_high
)

router = AdaptiveRouter(config)

# Route a query
tier = router.route("Explain the proof of Fermat's Last Theorem step by step")
# → "tier_high" (complex math query, VRAM available)

# After you get the response, report the actual latency so the SLA rule adapts
t0 = time.monotonic()
# ... call your LLM endpoint ...
router.observe(tier, latency_ms=(time.monotonic() - t0) * 1000)

# Clean up background VRAM monitor thread
router.shutdown()

Architecture

┌─────────────────────────────────────────────────────────────┐
│                      adaptive_router/                       │
│                                                             │
│  Layer 1: ComplexityScorer                                  │
│  ├── Rule-based, 6 dimensions, sub-ms, zero external calls  │
│  ├── math (0.25) · code (0.25) · depth (0.20)              │
│  ├── tokens (0.15) · entities (0.10) · negation (0.05)     │
│  └── Output: score in [0.0, 1.0]                           │
│                                                             │
│  Layer 2: VRAMMonitor (daemon thread)                       │
│  ├── pynvml direct — no subprocess nvidia-smi overhead     │
│  ├── Polling at 10-20ms with atomic shared state           │
│  └── Thread-safe VRAMState (free_gb, used_gb, util_pct)    │
│                                                             │
│  Layer 3: RoutingDecision (chain of rules)                  │
│  ├── complexity_rule: score < threshold → tier_low          │
│  ├── vram_rule: free_gb < headroom → tier_low               │
│  ├── sla_rule: EMA latency > SLA → tier_low                 │
│  └── Default fallthrough: tier_high                         │
└─────────────────────────────────────────────────────────────┘

Configuration reference

Field	Type	Default	Effect
`complexity_threshold`	`float`	`0.65`	Score ≥ this routes to tier_high
`vram_headroom_gb`	`float`	`1.5`	Free VRAM below this forces tier_low
`latency_sla_ms`	`float`	`2000.0`	EMA latency above this forces tier_low
`vram_poll_interval_ms`	`int`	`15`	VRAM polling frequency
`sla_warmup_seed_ms`	`float`	`0.0`	EMA seed at startup — set to p50 of tier_high to avoid cold-start burst

Load from YAML:

from adaptive_router.config import load_config

config = load_config("router_config.yaml")

# router_config.yaml
complexity_threshold: 0.65
vram_headroom_gb: 1.5
latency_sla_ms: 2000.0
vram_poll_interval_ms: 15
sla_warmup_seed_ms: 800.0

Deploy as FastAPI proxy

python -m adaptive_router.middleware \
    --tier-high-url http://llama-cpp:8080/v1 \
    --tier-low-url  http://vllm:8000/v1 \
    --port 9000

Exposes /v1/chat/completions (proxied), /health, and /metrics (Prometheus).

Custom routing rules

from adaptive_router import AdaptiveRouter, RoutingConfig, RouterState

def my_rule(query: str, state: RouterState):
    # Force tier_high for any query mentioning "contract"
    if "contract" in query.lower():
        return "tier_high"
    return None  # abstain, let next rule decide

router = AdaptiveRouter(config, rules=[my_rule])

Known limitations

16GB does not fit 26B in vLLM natively. The router mitigates this by routing complex queries to llama.cpp. The real fix for 200 concurrent users wanting 26B is dual-GPU.
Blackwell SM120 rough edges. FP8 KV and MTP require specific workarounds. See VRAM-REALITY-CHECK.md.
EXL2 is not designed for multi-user production. TabbyAPI maintainers state this explicitly. It's in benchmarks for completeness only.
SGLang does not always win. RadixAttention benefits depend on prefix overlap. See RADIXATTENTION-OPERATIVE-TABLE.md.
MTP in SM120 is not plug-and-play. With BF16 + workarounds it works; with NVFP4 it produces garbage.
Cold-start burst on tier_high. With sla_warmup_seed_ms=0.0, the EMA starts at 0ms — below any SLA. The first ~15-20 complex queries all hit tier_high before the EMA reflects real latency. Set sla_warmup_seed_ms to your expected p50.
This is v0.1. Functional and tested, but not battle-tested at thousands of users. It's the starting point.

Running tests

pip install -e ".[dev]"
pytest tests/ -v  # No GPU required — pynvml is mocked in all fixtures

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.1

May 15, 2026

This version

0.1.0

May 15, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gemma4_adaptive_router-0.1.0.tar.gz (11.3 kB view details)

Uploaded May 15, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

gemma4_adaptive_router-0.1.0-py3-none-any.whl (3.6 kB view details)

Uploaded May 15, 2026 Python 3

File details

Details for the file gemma4_adaptive_router-0.1.0.tar.gz.

File metadata

Download URL: gemma4_adaptive_router-0.1.0.tar.gz
Upload date: May 15, 2026
Size: 11.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for gemma4_adaptive_router-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`7f94556b984b2ec46e97360e648c923945429edcf09f6cf5d6bdec1e640748e1`
MD5	`1e1968d77fd6f17ce16a843fe71c62a7`
BLAKE2b-256	`38e685f909c9c364ce0c2ea7385877fc511641ce4f5639f1c9c8cda10a40de68`

See more details on using hashes here.

File details

Details for the file gemma4_adaptive_router-0.1.0-py3-none-any.whl.

File metadata

Download URL: gemma4_adaptive_router-0.1.0-py3-none-any.whl
Upload date: May 15, 2026
Size: 3.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for gemma4_adaptive_router-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f671fe417e088f3b63d3ec795a2ba8a0e189217044dba3025fbdc01a5a833aea`
MD5	`c0cc9c72b674238f2ddd4d4a3096dcf6`
BLAKE2b-256	`3c4188c3b484fd306e5f487200d303f773744069797b0188cad33b0235134a7e`

See more details on using hashes here.

gemma4-adaptive-router 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

gemma4-adaptive-router

Install

Quickstart

Architecture

Configuration reference

Deploy as FastAPI proxy

Custom routing rules

Known limitations

Running tests

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes