VRAM-aware LLM routing daemon — local-first, OpenAI-compatible

These details have not been verified by PyPI

Project links

Project description

neuralbroker

VRAM-aware LLM routing daemon · local-first · OpenAI-compatible

Python License: MIT FastAPI

NeuralBroker is an OpenAI-compatible routing daemon that sends LLM requests to your local runtimes first, then spills to cloud providers only when VRAM pressure or policy requires it. It keeps your existing SDK flow intact while reducing avoidable cloud spend by turning local hardware into a first-class inference backend.

Quickstart

Option A — pip install (recommended)

Works on macOS, Linux, and Windows.

pip install neuralbrok
neuralbrok setup
neuralbrok start

Option B — Install Normally (from source)

macOS / Linux:

git clone https://github.com/khan-sha/neuralbroker.git
cd neuralbroker
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
neuralbrok setup
neuralbrok start

Windows (PowerShell):

git clone https://github.com/khan-sha/neuralbroker.git
cd neuralbroker
python -m venv .venv
.venv\Scripts\Activate.ps1
pip install -r requirements.txt
neuralbrok setup
neuralbrok start

Option C — Docker

git clone https://github.com/khan-sha/neuralbroker.git
cd neuralbroker
cp .env.example .env
docker compose up -d

Proxy: http://localhost:8000/v1 · Dashboard: http://localhost:8000/dashboard

Prerequisites by platform

Platform	Required	Notes
macOS · Apple Silicon	Python 3.10+ · Ollama	Metal GPU · automatic unified memory detection
macOS · Intel	Python 3.10+ · Ollama	CPU inference · no VRAM pressure
Linux · NVIDIA	Python 3.10+ · Ollama · CUDA 11.8+	Full VRAM telemetry via pynvml
Linux · AMD	Python 3.10+ · Ollama · ROCm 5.0+	ROCm telemetry · llama.cpp recommended
Linux · CPU	Python 3.10+ · Ollama	CPU fallback · cloud spillover always active
Windows · NVIDIA	Python 3.10+ · Ollama · CUDA 11.8+	WSL2 recommended for best performance
Windows · CPU	Python 3.10+ · Ollama	CPU fallback
Docker · any	Docker Desktop or Docker Engine	No Python install needed

Ollama: ollama.com/download

One-line SDK change:

client = OpenAI(base_url="http://localhost:8000/v1", api_key="nb_live_...")

What setup does

When you run neuralbrok setup, the device detection module automatically profiles your hardware:

Detects your GPU vendor, model, and available VRAM (or unified memory).
Configures the optimal local runtime (Ollama for CUDA/Metal, llama.cpp for ROCm/CPU).
Calculates a safe VRAM threshold to avoid out-of-memory errors.
Estimates the local electricity cost (TDP) for accurate cloud-cost comparisons.
Recommends the best quantized models that fit entirely within your memory.

Supported hardware

NVIDIA: RTX 30/40 series, or any CUDA 11.8+ compatible GPU.
Apple Silicon: M1 through M4 (Base, Pro, Max, Ultra variants) via Metal unified memory.
AMD: Radeon GPUs supporting ROCm 5.0+.
CPU-only: Fallback mode for systems without a dedicated AI accelerator.

How It Works

Point your OpenAI SDK base_url to NeuralBroker.
NeuralBroker polls local GPU state (VRAM/utilization) on a short interval.
Policy engine scores local and cloud providers per request.
Response streams back in OpenAI format with routing headers/metrics.

Routing Modes

cost-mode

Route local when VRAM is under threshold, otherwise spill to the cheapest cloud backend.

routing:
  default_mode: cost

speed-mode

Always route local for lowest path latency and strict local-only behavior.

routing:
  default_mode: speed

fallback-mode

Prefer local; fall back to cloud on OOM/error; resume local when healthy.

routing:
  default_mode: fallback

Providers

20 Pattern-A providers · 8 Pattern-B providers · 4 local runtimes · 32 total

Model lists reflect April 2026. Provider catalogs change frequently — check each provider's docs for the latest available models.

Pattern A — OpenAI-compatible

Provider	Notes
OpenAI	gpt-5.4, gpt-5.4-mini, gpt-5.4-nano — flagship for complex reasoning and coding
Groq	Llama 4 Maverick, Llama 4 Scout, Qwen3-32B, Llama 3.3 70B — fastest inference via LPU, best first spillover target
Together AI	DeepSeek V3.2, Llama 4 Maverick, Qwen3-235B, Gemini 3.1 Flash Lite — widest open model catalog
Cerebras	Llama 3.3 70B, Qwen3-32B, Qwen3-235B, GPT-OSS 120B — wafer-scale, 20x faster throughput than NVIDIA
DeepInfra	DeepSeek V3.2, Qwen3-235B, Llama 4, Mistral Small 4 — cheapest per-token on most open models
Fireworks AI	Llama 4 Maverick, DeepSeek V3, Qwen3 — fast inference with strong function calling on open models
Lepton AI	Llama 3.3 70B, Qwen3 variants — serverless GPU cloud
Novita AI	Qwen3 and DeepSeek variants at lowest market pricing
Hyperbolic	Llama 4, Qwen3-235B — decentralized GPU marketplace, competitive on 70B+ models
Mistral AI	Mistral Small 4, Mistral Large, Codestral — only source for first-party Mistral models
Kimi (Moonshot)	Kimi K2.6 — highest-ranked open weights model on Intelligence Index (score 54), 1M token context
DeepSeek	DeepSeek V3.2, DeepSeek V3.1, DeepSeek R1 — best price-to-performance for coding and reasoning
Qwen (DashScope)	Qwen3-235B-A22B, Qwen3-32B, Qwen3-Coder, QwQ-32B, Qwen3.5 — Alibaba's hybrid reasoning/instruct family
Yi (01.AI)	Yi-Lightning, Yi-Large — strong multilingual, competitive pricing
Baichuan	Baichuan4 — strongest Chinese language understanding
Zhipu (GLM-4)	GLM-5.1 (Reasoning), GLM-5 — top open-weights reasoning models, score 51 on Intelligence Index
Perplexity	Sonar Pro, Sonar — live web search built in, unique for online and RAG workloads
AI21 Labs	Jamba 1.5 Large, Jamba 1.5 Mini — SSM-Transformer hybrid, long context at low cost
OctoAI	Llama 4, Qwen3 variants — auto-scaling serverless, good for burst spillover
OpenRouter	DeepSeek V3, Llama 4 Maverick, Qwen3-235B and 100+ more — last-resort fallback only

Pattern B — Custom translation layer

Provider	Notes
Anthropic	Claude Opus 4.7, Claude Sonnet 4.6, Claude Haiku 4.5 — best for agentic coding, long context, and complex reasoning
Google Gemini	Gemini 3.1 Pro, Gemini 3.1 Flash, Gemini 3 Flash Preview — top reasoning benchmarks, 1M token context
Cohere	Command-R+ — enterprise RAG with built-in grounding and citation
Replicate	Any open model including fine-tunes — polling-based predictions API, widest model selection
Cloudflare AI	Workers AI — edge inference at Cloudflare's global network, lowest geographic latency
AWS Bedrock	Claude Opus 4.7, Haiku 4.5, Llama 4, Amazon Nova 2 Pro/Lite/Micro — managed AWS, data residency
Azure OpenAI	gpt-5.4, gpt-5.4-mini — deployment-based, api-key auth, Microsoft enterprise agreements
Google Vertex	Gemini 3.1 Pro via GCP — VPC and private endpoint support for Google Cloud teams

Local runtimes

Runtime	Platform	Notes
Ollama	NVIDIA · Apple Silicon · AMD	Recommended default — native Metal and CUDA, Llama 4, Qwen3, DeepSeek models in model library
llama.cpp	NVIDIA · AMD · CPU	Best for AMD ROCm, CPU-only, and maximum quantization control — supports Qwen3, Llama 4, DeepSeek
LM Studio	NVIDIA · Apple Silicon	GUI-first model browser — exposes OpenAI-compatible server, good for Apple Silicon
vLLM	NVIDIA	Best throughput for concurrent requests — PagedAttention, continuous batching, production serving

Docker

docker compose up -d

This starts NeuralBroker plus supporting observability services (Prometheus/Grafana) defined in docker-compose.yml.

Development

Run locally:

uvicorn src.main:app --reload --host 0.0.0.0 --port 8000

Run tests:

pytest -q

Contributing

Contributions are welcome. If you have bug reports, routing ideas, provider integrations, or docs improvements, open an issue or PR and we will review quickly: GitHub Issues.

MIT License — see LICENSE.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.6.7

Apr 25, 2026

0.6.6

Apr 25, 2026

0.6.5

Apr 25, 2026

0.6.4

Apr 25, 2026

0.6.3

Apr 25, 2026

0.6.2

Apr 25, 2026

0.6.1

Apr 25, 2026

0.6.0

Apr 25, 2026

0.5.3

Apr 24, 2026

0.5.2

Apr 24, 2026

0.5.0

Apr 24, 2026

0.4.4

Apr 24, 2026

0.4.3

Apr 24, 2026

0.4.2

Apr 24, 2026

This version

0.4.0

Apr 23, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

neuralbrok-0.4.0.tar.gz (64.4 kB view details)

Uploaded Apr 23, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

neuralbrok-0.4.0-py3-none-any.whl (77.4 kB view details)

Uploaded Apr 23, 2026 Python 3

File details

Details for the file neuralbrok-0.4.0.tar.gz.

File metadata

Download URL: neuralbrok-0.4.0.tar.gz
Upload date: Apr 23, 2026
Size: 64.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for neuralbrok-0.4.0.tar.gz
Algorithm	Hash digest
SHA256	`9b139bcc907d59c95ee4bf3ea0a74758afdc59ab0f55e766a3e97abc38169f79`
MD5	`81c4c9e62dd2af93c4b64d535132887d`
BLAKE2b-256	`570811145530f6773a44fc9d84b5502defc2a55e4b1db02440df5d5d65b0a2b0`

See more details on using hashes here.

File details

Details for the file neuralbrok-0.4.0-py3-none-any.whl.

File metadata

Download URL: neuralbrok-0.4.0-py3-none-any.whl
Upload date: Apr 23, 2026
Size: 77.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for neuralbrok-0.4.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`86d4f4b54bab75b43dbad59f71503c0db9968f9799b6edf7b131b0cff9f0f4ab`
MD5	`09717be0b55fd45335eb2bdb71240ab6`
BLAKE2b-256	`1302ae192b06d918adc60cb103c4a561cb06d50bdc3664ca8677f71e3f663b5d`

See more details on using hashes here.

neuralbrok 0.4.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

neuralbroker

VRAM-aware LLM routing daemon · local-first · OpenAI-compatible

Quickstart

Option A — pip install (recommended)

Option B — Install Normally (from source)

Option C — Docker

Prerequisites by platform

What setup does

Supported hardware

How It Works

Routing Modes

cost-mode

speed-mode

fallback-mode

Providers

Pattern A — OpenAI-compatible

Pattern B — Custom translation layer

Local runtimes

Docker

Development

Contributing

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes