Skip to main content

Complexity + VRAM-aware routing for local dual-tier LLM deployments

Project description

gemma4-adaptive-router

Complexity + VRAM-aware routing for local dual-tier LLM deployments.
The only public implementation of sub-millisecond complexity scoring coupled to real-time VRAM scheduling for consumer GPU on-prem inference.

PyPI Python 3.10+ Apache-2.0

Install

pip install gemma4-adaptive-router

Quickstart

from adaptive_router import AdaptiveRouter, RoutingConfig
import time

config = RoutingConfig(
    complexity_threshold=0.65,   # score >= this → tier_high
    vram_headroom_gb=1.5,        # free VRAM below this → force tier_low
    latency_sla_ms=2000.0,       # EMA latency above this → force tier_low
    sla_warmup_seed_ms=800.0,    # seed EMA to avoid cold-start burst on tier_high
)

router = AdaptiveRouter(config)

# Route a query
tier = router.route("Explain the proof of Fermat's Last Theorem step by step")
# → "tier_high" (complex math query, VRAM available)

# After you get the response, report the actual latency so the SLA rule adapts
t0 = time.monotonic()
# ... call your LLM endpoint ...
router.observe(tier, latency_ms=(time.monotonic() - t0) * 1000)

# Clean up background VRAM monitor thread
router.shutdown()

Architecture

┌─────────────────────────────────────────────────────────────┐
│                      adaptive_router/                       │
│                                                             │
│  Layer 1: ComplexityScorer                                  │
│  ├── Rule-based, 6 dimensions, sub-ms, zero external calls  │
│  ├── math (0.25) · code (0.25) · depth (0.20)              │
│  ├── tokens (0.15) · entities (0.10) · negation (0.05)     │
│  └── Output: score in [0.0, 1.0]                           │
│                                                             │
│  Layer 2: VRAMMonitor (daemon thread)                       │
│  ├── pynvml direct — no subprocess nvidia-smi overhead     │
│  ├── Polling at 10-20ms with atomic shared state           │
│  └── Thread-safe VRAMState (free_gb, used_gb, util_pct)    │
│                                                             │
│  Layer 3: RoutingDecision (chain of rules)                  │
│  ├── complexity_rule: score < threshold → tier_low          │
│  ├── vram_rule: free_gb < headroom → tier_low               │
│  ├── sla_rule: EMA latency > SLA → tier_low                 │
│  └── Default fallthrough: tier_high                         │
└─────────────────────────────────────────────────────────────┘

Configuration reference

Field Type Default Effect
complexity_threshold float 0.65 Score ≥ this routes to tier_high
vram_headroom_gb float 1.5 Free VRAM below this forces tier_low
latency_sla_ms float 2000.0 EMA latency above this forces tier_low
vram_poll_interval_ms int 15 VRAM polling frequency
sla_warmup_seed_ms float 0.0 EMA seed at startup — set to p50 of tier_high to avoid cold-start burst

Load from YAML:

from adaptive_router.config import load_config

config = load_config("router_config.yaml")
# router_config.yaml
complexity_threshold: 0.65
vram_headroom_gb: 1.5
latency_sla_ms: 2000.0
vram_poll_interval_ms: 15
sla_warmup_seed_ms: 800.0

Deploy as FastAPI proxy

python -m adaptive_router.middleware \
    --tier-high-url http://llama-cpp:8080/v1 \
    --tier-low-url  http://vllm:8000/v1 \
    --port 9000

Exposes /v1/chat/completions (proxied), /health, and /metrics (Prometheus).

Custom routing rules

from adaptive_router import AdaptiveRouter, RoutingConfig, RouterState

def my_rule(query: str, state: RouterState):
    # Force tier_high for any query mentioning "contract"
    if "contract" in query.lower():
        return "tier_high"
    return None  # abstain, let next rule decide

router = AdaptiveRouter(config, rules=[my_rule])

Known limitations

  1. 16GB does not fit 26B in vLLM natively. The router mitigates this by routing complex queries to llama.cpp. The real fix for 200 concurrent users wanting 26B is dual-GPU.
  2. Blackwell SM120 rough edges. FP8 KV and MTP require specific workarounds. See VRAM-REALITY-CHECK.md.
  3. EXL2 is not designed for multi-user production. TabbyAPI maintainers state this explicitly. It's in benchmarks for completeness only.
  4. SGLang does not always win. RadixAttention benefits depend on prefix overlap. See RADIXATTENTION-OPERATIVE-TABLE.md.
  5. MTP in SM120 is not plug-and-play. With BF16 + workarounds it works; with NVFP4 it produces garbage.
  6. Cold-start burst on tier_high. With sla_warmup_seed_ms=0.0, the EMA starts at 0ms — below any SLA. The first ~15-20 complex queries all hit tier_high before the EMA reflects real latency. Set sla_warmup_seed_ms to your expected p50.
  7. This is v0.1. Functional and tested, but not battle-tested at thousands of users. It's the starting point.

Running tests

pip install -e ".[dev]"
pytest tests/ -v  # No GPU required — pynvml is mocked in all fixtures

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gemma4_adaptive_router-0.1.1.tar.gz (11.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gemma4_adaptive_router-0.1.1-py3-none-any.whl (3.6 kB view details)

Uploaded Python 3

File details

Details for the file gemma4_adaptive_router-0.1.1.tar.gz.

File metadata

  • Download URL: gemma4_adaptive_router-0.1.1.tar.gz
  • Upload date:
  • Size: 11.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for gemma4_adaptive_router-0.1.1.tar.gz
Algorithm Hash digest
SHA256 6ce9edd76eea1d54c0377b21a13f2720a3af7833201df8187434d2876ec94e87
MD5 e951424e45f10d3421d6a510f40d1421
BLAKE2b-256 3c1bce862c990dc99f1fc838f61239b624573278272ee24f9c93ec8efa44a476

See more details on using hashes here.

Provenance

The following attestation bundles were made for gemma4_adaptive_router-0.1.1.tar.gz:

Publisher: publish.yml on angelnicolasc/Stratum

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file gemma4_adaptive_router-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for gemma4_adaptive_router-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 d49a14287af6dbd00e275a7adf70e771bd2122bd2b44390bdec48a08cba24d80
MD5 24edd5e20f3b6cc224ea6cacbeedea1e
BLAKE2b-256 55358d2347ccbd478f122de0b2bc5d7aaca9bdbdadc9be8538de73dcce36dfc3

See more details on using hashes here.

Provenance

The following attestation bundles were made for gemma4_adaptive_router-0.1.1-py3-none-any.whl:

Publisher: publish.yml on angelnicolasc/Stratum

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page