VRAM-aware LLM routing daemon — local-first, OpenAI-compatible

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

khan-ash

These details have not been verified by PyPI

Project links

Homepage

Project description

neuralbroker

VRAM-aware LLM routing daemon · local-first · OpenAI-compatible

Python License: MIT FastAPI

NeuralBroker is an OpenAI-compatible routing daemon that sends LLM requests to your local runtimes first, then spills to cloud providers only when VRAM pressure or policy requires it. It keeps your existing SDK flow intact while reducing avoidable cloud spend by turning local hardware into a first-class inference backend.

What makes it different

VRAM-aware routing — polls GPU state every 500ms; routes locally when memory is free, spills to cloud when pressure builds
SmartModelSelector — scores every runnable local model on params, workload fit, tok/s, context length, VRAM headroom, and MoE architecture; picks the best one per request category, not just the largest
MoE detection — identifies mixture-of-experts models (e.g. qwen3:30b-a3b) and scores them separately from dense models; prioritizes them for fast-response workloads where low activation count wins
Four routing modes — cost, speed, fallback, and smart (prompt-classified, model-matched)
Manual model selection — technical users can pick exact local models by number during setup; no guessing
32 providers, zero SDK changes — point base_url at NeuralBroker, keep the same openai client
Interactive setup TUI — hardware detection, VRAM visualization, model compatibility bars, algorithm self-test, and routing mode selection all in one guided flow

Quickstart

Option A — pip install (recommended)

Works on macOS, Linux, and Windows.

pip install neuralbrok
neuralbrok setup
neuralbrok start

Option B — from source

macOS / Linux:

git clone https://github.com/khan-sha/neuralbroker.git
cd neuralbroker
python3 -m venv .venv
source .venv/bin/activate
pip install -e .
neuralbrok setup
neuralbrok start

Windows (PowerShell):

git clone https://github.com/khan-sha/neuralbroker.git
cd neuralbroker
python -m venv .venv
.venv\Scripts\Activate.ps1
pip install -e .
neuralbrok setup
neuralbrok start

Option C — Docker

git clone https://github.com/khan-sha/neuralbroker.git
cd neuralbroker
cp .env.example .env
docker compose up -d

Proxy: http://localhost:8000/v1 · Dashboard: http://localhost:8000/dashboard

Prerequisites by platform

Platform	Required	Notes
macOS · Apple Silicon	Python 3.10+ · Ollama	Metal GPU · automatic unified memory detection
macOS · Intel	Python 3.10+ · Ollama	CPU inference · no VRAM pressure
Linux · NVIDIA	Python 3.10+ · Ollama · CUDA 11.8+	Full VRAM telemetry via pynvml
Linux · AMD	Python 3.10+ · Ollama · ROCm 5.0+	ROCm telemetry · llama.cpp recommended
Linux · CPU	Python 3.10+ · Ollama	CPU fallback · cloud spillover always active
Windows · NVIDIA	Python 3.10+ · Ollama · CUDA 11.8+	WSL2 recommended for best performance
Windows · CPU	Python 3.10+ · Ollama	CPU fallback
Docker · any	Docker Desktop or Docker Engine	No Python install needed

Ollama: ollama.com/download

One-line SDK change:

client = OpenAI(base_url="http://localhost:8000/v1", api_key="nb_live_...")

What setup does

neuralbrok setup runs a fully interactive TUI that:

Detects GPU vendor, model, and available VRAM (or Apple unified memory)
Shows a color-coded VRAM bar and backend (CUDA / Metal / ROCm / CPU)
Profiles every installed Ollama model — VRAM fit, estimated tok/s, compatibility score
Lets you pick workload type (code, chat, reasoning, RAG, mixed, etc.) or use manual mode to select exact models by number from a ranked list
Runs an algorithm self-test showing which model routes to 4 sample prompts and why
Prompts for cloud provider API keys (optional — local-only works without any)
Writes ~/.neuralbrok/config.yaml

Supported hardware

NVIDIA: RTX 30/40 series, or any CUDA 11.8+ compatible GPU
Apple Silicon: M1 through M4 (Base, Pro, Max, Ultra variants) via Metal unified memory
AMD: Radeon GPUs supporting ROCm 5.0+
CPU-only: Fallback mode for systems without a dedicated AI accelerator

How it works

Point your OpenAI SDK base_url to NeuralBroker.
NeuralBroker polls local GPU state (VRAM / utilization) every 500ms.
Policy engine scores local and cloud providers per request using the active routing mode.
Response streams back in OpenAI format with routing headers and metrics.

Routing modes

cost

Route local when VRAM is under threshold, otherwise spill to the cheapest cloud backend.

routing:
  default_mode: cost

speed

Always route local for lowest path latency. Cloud only on local failure.

routing:
  default_mode: speed

fallback

Prefer local. Fall back to cloud on OOM or repeated error within a 30s window. Resume local when healthy.

routing:
  default_mode: fallback

smart

Classify each prompt into a workload category (code, reasoning, RAG, fast response, long context, tools, chat). Run SmartModelSelector to score all runnable local models against that category using params, workload fit, tok/s, context length, VRAM headroom, and MoE architecture weight. Pick the best match. Fall back to cloud only on failure.

routing:
  default_mode: smart

SmartModelSelector

The selector scores every model that fits in available VRAM and ranks them per request:

Signal	Weight
Parameter count	baseline score
Workload capability match	+15 per matching tag
Workload recommended_for match	+20 per matching tag
tok/s on your hardware (>60)	+10
tok/s on your hardware (>30)	+5
tok/s on your hardware (<10)	−10
Long-context request + ctx ≥128k	+25
VRAM headroom (free − model)	×2
MoE model + fast_response workload	+15

Scores are normalized to 0–100% and the top 3 are returned. The highest scorer routes.

API endpoints

Endpoint	Description
`POST /v1/chat/completions`	OpenAI-compatible chat completions
`POST /v1/completions`	OpenAI-compatible text completions
`GET /health`	Health check — returns status and uptime
`GET /nb/stats`	Routing stats — requests routed, fallbacks, smart classifications, avg classify ms
`GET /metrics`	Prometheus metrics scrape endpoint
`GET /dashboard`	Local web dashboard — live routing view

Providers

32 total: 20 Pattern-A (OpenAI-compatible) · 8 Pattern-B (custom adapter) · 4 local runtimes

Model lists reflect current catalog at time of release. Provider catalogs change frequently — check each provider's docs for latest available models.

Pattern A — OpenAI-compatible

Provider	Notes
OpenAI	GPT-4o, GPT-4o mini — flagship for complex reasoning and coding
Groq	Llama 4 Maverick, Llama 4 Scout, Qwen3-32B, Llama 3.3 70B — fastest inference via LPU, best first spillover target
Together AI	DeepSeek V3, Llama 4 Maverick, Qwen3-235B — widest open model catalog
Cerebras	Llama 3.3 70B, Qwen3-32B — wafer-scale, 20x faster throughput than NVIDIA
DeepInfra	DeepSeek V3, Qwen3-235B, Llama 4, Mistral Small — cheapest per-token on most open models
Fireworks AI	Llama 4 Maverick, DeepSeek V3, Qwen3 — fast inference with strong function calling on open models
Lepton AI	Llama 3.3 70B, Qwen3 variants — serverless GPU cloud
Novita AI	Qwen3 and DeepSeek variants at lowest market pricing
Hyperbolic	Llama 4, Qwen3-235B — decentralized GPU marketplace, competitive on 70B+ models
Mistral AI	Mistral Small, Mistral Large, Codestral — only source for first-party Mistral models
Kimi (Moonshot)	Kimi K2 — strong long-context (1M token) and multilingual
DeepSeek	DeepSeek V3, DeepSeek R1 — best price-to-performance for coding and reasoning
Qwen (DashScope)	Qwen3-235B-A22B, Qwen3-32B, Qwen3-Coder, QwQ-32B — Alibaba's hybrid reasoning/instruct family
Yi (01.AI)	Yi-Lightning, Yi-Large — strong multilingual, competitive pricing
Baichuan	Baichuan4 — strongest Chinese language understanding
Zhipu (GLM-4)	GLM-4 — strong open-weights reasoning
Perplexity	Sonar Pro, Sonar — live web search built in, unique for online and RAG workloads
AI21 Labs	Jamba 1.5 Large, Jamba 1.5 Mini — SSM-Transformer hybrid, long context at low cost
OctoAI	Llama 4, Qwen3 variants — auto-scaling serverless, good for burst spillover
OpenRouter	100+ models — last-resort fallback with widest model selection

Pattern B — Custom translation layer

Provider	Notes
Anthropic	Claude Sonnet, Claude Haiku — best for agentic coding, long context, and complex reasoning
Google Gemini	Gemini 1.5 Pro, Gemini 1.5 Flash — top reasoning benchmarks, 1M token context
Cohere	Command-R+ — enterprise RAG with built-in grounding and citation
Replicate	Any open model including fine-tunes — polling-based predictions API, widest model selection
Cloudflare AI	Workers AI — edge inference at Cloudflare's global network, lowest geographic latency
AWS Bedrock	Claude, Llama 4, Amazon Nova — managed AWS, data residency
Azure OpenAI	GPT-4o, GPT-4o mini — deployment-based, Microsoft enterprise agreements
Google Vertex	Gemini Pro via GCP — VPC and private endpoint support for Google Cloud teams

Local runtimes

Runtime	Platform	Notes
Ollama	NVIDIA · Apple Silicon · AMD	Recommended default — native Metal and CUDA, Llama 4, Qwen3, DeepSeek in model library
llama.cpp	NVIDIA · AMD · CPU	Best for AMD ROCm, CPU-only, and maximum quantization control
LM Studio	NVIDIA · Apple Silicon	GUI-first model browser — exposes OpenAI-compatible server
vLLM	NVIDIA	Best throughput for concurrent requests — PagedAttention, continuous batching, production serving

Integrations

NeuralBroker exposes an OpenAI-compatible endpoint at localhost:8000. Most AI coding agents and IDEs can point to it as their backend. The neuralbrok integrations CLI command automates the configuration for 23 major agents.

Agent	Type	Config	Setup
Claude Code	✅ native OpenAI-compat	`.claude/settings.json`	`neuralbrok integrations setup claude-code`
Cursor	✅ native	`.cursor/mcp.json`	`neuralbrok integrations setup cursor`
Cline	✅ native	`.cline/settings.json`	`neuralbrok integrations setup cline`
GitHub Copilot	🔄 via proxy	`.vscode/settings.json`	`neuralbrok integrations setup github-copilot`
Gemini CLI	✅ native	`.gemini/settings.json`	`neuralbrok integrations setup gemini-cli`
OpenCode	✅ native	`opencode.json`	`neuralbrok integrations setup opencode`
Warp	✅ env var	`~/.warp/preferences.yaml`	`neuralbrok integrations setup warp`
Codex	✅ env var	`.env + ~/.codex/config.json`	`neuralbrok integrations setup codex`
Amp	✅ native	`~/.amp/config.json`	`neuralbrok integrations setup amp`
Kimi Code	✅ env var	`.kimi/config.json + .env`	`neuralbrok integrations setup kimi-code`
Firebender	✅ native	`.firebender/config.json`	`neuralbrok integrations setup firebender`
Deep Agents	✅ native	`.deepagent/config.json`	`neuralbrok integrations setup deep-agents`
Augment	📝 skill file	`.augment/skills/neuralbroker.md`	`neuralbrok integrations setup augment`
IBM Bob	📝 skill file	`.bob/skills/neuralbroker.md`	`neuralbrok integrations setup ibm-bob`
OpenClaw	📝 skill file	`skills/neuralbroker.md`	`neuralbrok integrations setup openclaw`
CodeBuddy	📝 skill file	`.codebuddy/skills/neuralbroker.md`	`neuralbrok integrations setup codebuddy`
Cortex Code	📝 skill file	`.cortex/skills/neuralbroker.md`	`neuralbrok integrations setup cortex-code`
Kilo Code	📝 skill file	`.kilocode/skills/neuralbroker.md`	`neuralbrok integrations setup kilo-code`
Kiro CLI	📝 skill file	`.kiro/skills/neuralbroker.md`	`neuralbrok integrations setup kiro-cli`
Kode	📝 skill file	`.kode/skills/neuralbroker.md`	`neuralbrok integrations setup kode`
Qwen Code	📝 skill file	`.qwen/skills/neuralbroker.md`	`neuralbrok integrations setup qwen-code`
Trae	📝 skill file	`.trae/skills/neuralbroker.md`	`neuralbrok integrations setup trae`
Windsurf	📝 skill file	`.windsurf/skills/neuralbroker.md`	`neuralbrok integrations setup windsurf`

Docker

docker compose up -d

Starts NeuralBroker plus Prometheus and Grafana for observability. Configure credentials in .env (see .env.example).

Development

uvicorn src.neuralbrok.main:app --reload --host 0.0.0.0 --port 8000

Run tests:

pytest -q

Roadmap

Phase 1 (BETA)

Claude Code terminal connection — neuralbrok code runs NeuralBroker-aware Claude Code shell with routing context
Dashboard v2 — live routing waterfall, model switching, per-provider cost graph
neuralbrok doctor — diagnose config issues, test provider connectivity, benchmark local models

Phase 2

Hermes agent integration — deploy autonomous agents that use NeuralBroker for model selection and routing
Openclaw integration — connect Openclaw orchestrator to NeuralBroker for decentralized agent coordination
Prompt caching integration — detect repeated system prompts and route to providers with cache discount
Per-model cost tracking — log actual token spend per model per day with budget alerts

Phase 3

Dynamic provider weighting — auto-demote slow or error-prone providers without manual config changes
Fine-grained routing rules — route by model name, tag, or regex in config.yaml
GGUF download helper — pull and register quantized models from HuggingFace directly from CLI
Multi-GPU support — VRAM aggregation and per-GPU model pinning for multi-card setups
Multi-agent framework support — extend to Anthropic Managed Agents, LangGraph, CrewAI

Contributing

Contributions are welcome. Bug reports, routing ideas, provider integrations, docs improvements — open an issue or PR: GitHub Issues.

MIT License — see LICENSE.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

khan-ash

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.6.7

Apr 25, 2026

0.6.6

Apr 25, 2026

This version

0.6.5

Apr 25, 2026

0.6.4

Apr 25, 2026

0.6.3

Apr 25, 2026

0.6.2

Apr 25, 2026

0.6.1

Apr 25, 2026

0.6.0

Apr 25, 2026

0.5.3

Apr 24, 2026

0.5.2

Apr 24, 2026

0.5.0

Apr 24, 2026

0.4.4

Apr 24, 2026

0.4.3

Apr 24, 2026

0.4.2

Apr 24, 2026

0.4.0

Apr 23, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

neuralbrok-0.6.5.tar.gz (118.1 kB view details)

Uploaded Apr 25, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

neuralbrok-0.6.5-py3-none-any.whl (116.3 kB view details)

Uploaded Apr 25, 2026 Python 3

File details

Details for the file neuralbrok-0.6.5.tar.gz.

File metadata

Download URL: neuralbrok-0.6.5.tar.gz
Upload date: Apr 25, 2026
Size: 118.1 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for neuralbrok-0.6.5.tar.gz
Algorithm	Hash digest
SHA256	`1c104e6e0bde548584c3364ab8ca168a9150ec8fd1de15129d8be88df204179f`
MD5	`e8ea42bdb7593f9cfb63ff9aa1a4e075`
BLAKE2b-256	`962bb97bcbb7464a2976f2a7364c41f8e03d17474006eed0e5d54738dc736c56`

See more details on using hashes here.

Provenance

The following attestation bundles were made for neuralbrok-0.6.5.tar.gz:

Publisher: pypi-publish.yml on khan-sha/neuralbroker

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: neuralbrok-0.6.5.tar.gz
- Subject digest: 1c104e6e0bde548584c3364ab8ca168a9150ec8fd1de15129d8be88df204179f
- Sigstore transparency entry: 1376470538
- Sigstore integration time: Apr 25, 2026
Source repository:
- Permalink: khan-sha/neuralbroker@0ec6c7ab876877e37dffdef97afeb6af95d9bd5d
- Branch / Tag: refs/tags/v0.6.5
- Owner: https://github.com/khan-sha
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: pypi-publish.yml@0ec6c7ab876877e37dffdef97afeb6af95d9bd5d
- Trigger Event: push

File details

Details for the file neuralbrok-0.6.5-py3-none-any.whl.

File metadata

Download URL: neuralbrok-0.6.5-py3-none-any.whl
Upload date: Apr 25, 2026
Size: 116.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for neuralbrok-0.6.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`fd202feffc3aa49bc05a472dcdc410855233dd7452c44beb95f0960a5d2fe9c4`
MD5	`0b9f77e3a520f2b9ced419b5b17a3828`
BLAKE2b-256	`586d8e815e812dac0650c8d7155c28bee13c2789e7680d05da211515d8f7e440`

See more details on using hashes here.

Provenance

The following attestation bundles were made for neuralbrok-0.6.5-py3-none-any.whl:

Publisher: pypi-publish.yml on khan-sha/neuralbroker

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: neuralbrok-0.6.5-py3-none-any.whl
- Subject digest: fd202feffc3aa49bc05a472dcdc410855233dd7452c44beb95f0960a5d2fe9c4
- Sigstore transparency entry: 1376470542
- Sigstore integration time: Apr 25, 2026
Source repository:
- Permalink: khan-sha/neuralbroker@0ec6c7ab876877e37dffdef97afeb6af95d9bd5d
- Branch / Tag: refs/tags/v0.6.5
- Owner: https://github.com/khan-sha
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: pypi-publish.yml@0ec6c7ab876877e37dffdef97afeb6af95d9bd5d
- Trigger Event: push

neuralbrok 0.6.5

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

neuralbroker

VRAM-aware LLM routing daemon · local-first · OpenAI-compatible

What makes it different

Quickstart

Option A — pip install (recommended)

Option B — from source

Option C — Docker

Prerequisites by platform

What setup does

Supported hardware

How it works

Routing modes

cost

speed

fallback

smart

SmartModelSelector

API endpoints

Providers

Pattern A — OpenAI-compatible

Pattern B — Custom translation layer

Local runtimes

Integrations

Docker

Development

Roadmap

Phase 1 (BETA)

Phase 2

Phase 3

Contributing

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance