VRAM-aware LLM routing daemon — local-first, OpenAI-compatible
Project description
neuralbroker
VRAM-aware LLM routing daemon · local-first · OpenAI-compatible
NeuralBroker is an OpenAI-compatible routing daemon that sends LLM requests to your local runtimes first, then spills to cloud providers only when VRAM pressure or policy requires it. It keeps your existing SDK flow intact while reducing avoidable cloud spend by turning local hardware into a first-class inference backend.
Quickstart
Option A — pip install (recommended)
Works on macOS, Linux, and Windows.
pip install neuralbrok
neuralbrok setup
neuralbrok start
Option B — Install Normally (from source)
macOS / Linux:
git clone https://github.com/khan-sha/neuralbroker.git
cd neuralbroker
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
neuralbrok setup
neuralbrok start
Windows (PowerShell):
git clone https://github.com/khan-sha/neuralbroker.git
cd neuralbroker
python -m venv .venv
.venv\Scripts\Activate.ps1
pip install -r requirements.txt
neuralbrok setup
neuralbrok start
Option C — Docker
git clone https://github.com/khan-sha/neuralbroker.git
cd neuralbroker
cp .env.example .env
docker compose up -d
Proxy: http://localhost:8000/v1 · Dashboard: http://localhost:8000/dashboard
Prerequisites by platform
| Platform | Required | Notes |
|---|---|---|
| macOS · Apple Silicon | Python 3.10+ · Ollama | Metal GPU · automatic unified memory detection |
| macOS · Intel | Python 3.10+ · Ollama | CPU inference · no VRAM pressure |
| Linux · NVIDIA | Python 3.10+ · Ollama · CUDA 11.8+ | Full VRAM telemetry via pynvml |
| Linux · AMD | Python 3.10+ · Ollama · ROCm 5.0+ | ROCm telemetry · llama.cpp recommended |
| Linux · CPU | Python 3.10+ · Ollama | CPU fallback · cloud spillover always active |
| Windows · NVIDIA | Python 3.10+ · Ollama · CUDA 11.8+ | WSL2 recommended for best performance |
| Windows · CPU | Python 3.10+ · Ollama | CPU fallback |
| Docker · any | Docker Desktop or Docker Engine | No Python install needed |
Ollama: ollama.com/download
One-line SDK change:
client = OpenAI(base_url="http://localhost:8000/v1", api_key="nb_live_...")
What setup does
When you run neuralbrok setup, the device detection module automatically profiles your hardware:
- Detects your GPU vendor, model, and available VRAM (or unified memory).
- Configures the optimal local runtime (Ollama for CUDA/Metal, llama.cpp for ROCm/CPU).
- Calculates a safe VRAM threshold to avoid out-of-memory errors.
- Estimates the local electricity cost (TDP) for accurate cloud-cost comparisons.
- Recommends the best quantized models that fit entirely within your memory.
Supported hardware
- NVIDIA: RTX 30/40 series, or any CUDA 11.8+ compatible GPU.
- Apple Silicon: M1 through M4 (Base, Pro, Max, Ultra variants) via Metal unified memory.
- AMD: Radeon GPUs supporting ROCm 5.0+.
- CPU-only: Fallback mode for systems without a dedicated AI accelerator.
How It Works
- Point your OpenAI SDK
base_urlto NeuralBroker. - NeuralBroker polls local GPU state (VRAM/utilization) on a short interval.
- Policy engine scores local and cloud providers per request.
- Response streams back in OpenAI format with routing headers/metrics.
Routing Modes
cost-mode
Route local when VRAM is under threshold, otherwise spill to the cheapest cloud backend.
routing:
default_mode: cost
speed-mode
Always route local for lowest path latency and strict local-only behavior.
routing:
default_mode: speed
fallback-mode
Prefer local; fall back to cloud on OOM/error; resume local when healthy.
routing:
default_mode: fallback
Providers
20 Pattern-A providers · 8 Pattern-B providers · 4 local runtimes · 32 total
Model lists reflect April 2026. Provider catalogs change frequently — check each provider's docs for the latest available models.
Pattern A — OpenAI-compatible
| Provider | Notes |
|---|---|
| OpenAI | gpt-5.4, gpt-5.4-mini, gpt-5.4-nano — flagship for complex reasoning and coding |
| Groq | Llama 4 Maverick, Llama 4 Scout, Qwen3-32B, Llama 3.3 70B — fastest inference via LPU, best first spillover target |
| Together AI | DeepSeek V3.2, Llama 4 Maverick, Qwen3-235B, Gemini 3.1 Flash Lite — widest open model catalog |
| Cerebras | Llama 3.3 70B, Qwen3-32B, Qwen3-235B, GPT-OSS 120B — wafer-scale, 20x faster throughput than NVIDIA |
| DeepInfra | DeepSeek V3.2, Qwen3-235B, Llama 4, Mistral Small 4 — cheapest per-token on most open models |
| Fireworks AI | Llama 4 Maverick, DeepSeek V3, Qwen3 — fast inference with strong function calling on open models |
| Lepton AI | Llama 3.3 70B, Qwen3 variants — serverless GPU cloud |
| Novita AI | Qwen3 and DeepSeek variants at lowest market pricing |
| Hyperbolic | Llama 4, Qwen3-235B — decentralized GPU marketplace, competitive on 70B+ models |
| Mistral AI | Mistral Small 4, Mistral Large, Codestral — only source for first-party Mistral models |
| Kimi (Moonshot) | Kimi K2.6 — highest-ranked open weights model on Intelligence Index (score 54), 1M token context |
| DeepSeek | DeepSeek V3.2, DeepSeek V3.1, DeepSeek R1 — best price-to-performance for coding and reasoning |
| Qwen (DashScope) | Qwen3-235B-A22B, Qwen3-32B, Qwen3-Coder, QwQ-32B, Qwen3.5 — Alibaba's hybrid reasoning/instruct family |
| Yi (01.AI) | Yi-Lightning, Yi-Large — strong multilingual, competitive pricing |
| Baichuan | Baichuan4 — strongest Chinese language understanding |
| Zhipu (GLM-4) | GLM-5.1 (Reasoning), GLM-5 — top open-weights reasoning models, score 51 on Intelligence Index |
| Perplexity | Sonar Pro, Sonar — live web search built in, unique for online and RAG workloads |
| AI21 Labs | Jamba 1.5 Large, Jamba 1.5 Mini — SSM-Transformer hybrid, long context at low cost |
| OctoAI | Llama 4, Qwen3 variants — auto-scaling serverless, good for burst spillover |
| OpenRouter | DeepSeek V3, Llama 4 Maverick, Qwen3-235B and 100+ more — last-resort fallback only |
Pattern B — Custom translation layer
| Provider | Notes |
|---|---|
| Anthropic | Claude Opus 4.7, Claude Sonnet 4.6, Claude Haiku 4.5 — best for agentic coding, long context, and complex reasoning |
| Google Gemini | Gemini 3.1 Pro, Gemini 3.1 Flash, Gemini 3 Flash Preview — top reasoning benchmarks, 1M token context |
| Cohere | Command-R+ — enterprise RAG with built-in grounding and citation |
| Replicate | Any open model including fine-tunes — polling-based predictions API, widest model selection |
| Cloudflare AI | Workers AI — edge inference at Cloudflare's global network, lowest geographic latency |
| AWS Bedrock | Claude Opus 4.7, Haiku 4.5, Llama 4, Amazon Nova 2 Pro/Lite/Micro — managed AWS, data residency |
| Azure OpenAI | gpt-5.4, gpt-5.4-mini — deployment-based, api-key auth, Microsoft enterprise agreements |
| Google Vertex | Gemini 3.1 Pro via GCP — VPC and private endpoint support for Google Cloud teams |
Local runtimes
| Runtime | Platform | Notes |
|---|---|---|
| Ollama | NVIDIA · Apple Silicon · AMD | Recommended default — native Metal and CUDA, Llama 4, Qwen3, DeepSeek models in model library |
| llama.cpp | NVIDIA · AMD · CPU | Best for AMD ROCm, CPU-only, and maximum quantization control — supports Qwen3, Llama 4, DeepSeek |
| LM Studio | NVIDIA · Apple Silicon | GUI-first model browser — exposes OpenAI-compatible server, good for Apple Silicon |
| vLLM | NVIDIA | Best throughput for concurrent requests — PagedAttention, continuous batching, production serving |
Docker
docker compose up -d
This starts NeuralBroker plus supporting observability services (Prometheus/Grafana) defined in docker-compose.yml.
Development
Run locally:
uvicorn src.main:app --reload --host 0.0.0.0 --port 8000
Run tests:
pytest -q
Contributing
Contributions are welcome. If you have bug reports, routing ideas, provider integrations, or docs improvements, open an issue or PR and we will review quickly: GitHub Issues.
MIT License — see LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file neuralbrok-0.4.0.tar.gz.
File metadata
- Download URL: neuralbrok-0.4.0.tar.gz
- Upload date:
- Size: 64.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9b139bcc907d59c95ee4bf3ea0a74758afdc59ab0f55e766a3e97abc38169f79
|
|
| MD5 |
81c4c9e62dd2af93c4b64d535132887d
|
|
| BLAKE2b-256 |
570811145530f6773a44fc9d84b5502defc2a55e4b1db02440df5d5d65b0a2b0
|
File details
Details for the file neuralbrok-0.4.0-py3-none-any.whl.
File metadata
- Download URL: neuralbrok-0.4.0-py3-none-any.whl
- Upload date:
- Size: 77.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
86d4f4b54bab75b43dbad59f71503c0db9968f9799b6edf7b131b0cff9f0f4ab
|
|
| MD5 |
09717be0b55fd45335eb2bdb71240ab6
|
|
| BLAKE2b-256 |
1302ae192b06d918adc60cb103c4a561cb06d50bdc3664ca8677f71e3f663b5d
|