Skip to main content

VRAM-aware LLM routing daemon — local-first, OpenAI-compatible

Project description

NeuralBroker

NeuralBroker logo

VRAM-aware LLM routing daemon · local-first · OpenAI-compatible

Your GPU first. Your subscription next. Zero new API keys.

Python PyPI License: MIT FastAPI Website


The idea in one sentence

You already pay for Claude Pro. NeuralBroker lets every app on your machine use it — automatically, for free, without ever touching an API key.

pip install neuralbrok
neuralbrok start

That's it. NeuralBroker auto-detects your installed Claude Code OAuth session, ChatGPT auth, Ollama models, and any environment API keys — then presents them as a single OpenAI-compatible endpoint at localhost:8000/v1. Point any tool at that URL and routing is live.


What no one else does

1. Subscription inheritance — turn a $20/month sub into a free API

NeuralBroker reads the OAuth session your Claude Code CLI already holds in ~/.claude/.credentials.json. It shells out to the claude binary to answer requests. No token copying. No API key. Your Claude Pro or Max subscription covers it at zero marginal cost.

Auto-discovered on startup:
  ✓ Claude PRO subscription    ~/.claude/.credentials.json
  ✓ Ollama (v0.20.4)           localhost:11434
  ✓ ChatGPT subscription       ~/.codex/auth.json  (roadmap)

Any app that speaks OpenAI format now routes through your subscription:

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="any-string")
response = client.chat.completions.create(
    model="claude-sonnet-4-6",
    messages=[{"role": "user", "content": "Hello"}]
)
# Routed through your Claude Pro subscription. Cost: $0.

2. VRAM-aware local-first routing — your GPU before the cloud

NeuralBroker polls GPU state every 500ms. Requests go to your local Ollama models first. Cloud kicks in only when VRAM is under pressure or local fails. Your electricity bill, not Anthropic's servers.

Request arrives → check VRAM → qwen2.5:0.5b fits → route local → $0.00001
Request arrives → VRAM full → spill → claude_code subprocess → $0.00000 (Pro sub)
Request arrives → no local match → spill → cheapest cloud provider

3. Zero config — discovers everything automatically

No yaml required. On first start NeuralBroker:

  • Reads ~/.claude/.credentials.json → registers Claude Pro/Max via subprocess
  • Reads ~/.codex/auth.json → registers OpenAI or ChatGPT tokens
  • Scans environment for ANTHROPIC_API_KEY, OPENAI_API_KEY, GROQ_API_KEY, etc.
  • Pings localhost:11434 → registers Ollama if running
  • Pings localhost:8080 → registers llama.cpp if running

Then picks the cheapest path per request. You never write a config file.

4. Claude Code + Ollama hybrid — one endpoint for your whole dev stack

# NeuralBroker running in the background
neuralbrok start

# Claude Code routes through NB (simple tasks → local, hard tasks → Claude Pro)
ANTHROPIC_BASE_URL=http://localhost:8000/v1 claude

# Cursor, Cline, Codex — same endpoint
# neuralbrok integrations setup cursor
# neuralbrok integrations setup codex

Simple coding tasks → qwen2.5:0.5b locally, instant, free. Complex reasoning → claude-sonnet-4-6 via your Pro subscription, free. Cloud overflow → cheapest available provider.


Quickstart

Option A — pip install

pip install neuralbrok
neuralbrok start

Option B — from source

macOS / Linux:

git clone https://github.com/khan-sha/neuralbroker.git
cd neuralbroker
python3 -m venv .venv && source .venv/bin/activate
pip install -e .
neuralbrok start

Windows (PowerShell):

git clone https://github.com/khan-sha/neuralbroker.git
cd neuralbroker
python -m venv .venv; .venv\Scripts\Activate.ps1
pip install -e .
neuralbrok start

Option C — Docker

git clone https://github.com/khan-sha/neuralbroker.git
cd neuralbroker
cp .env.example .env
docker compose up -d

Proxy: http://localhost:8000/v1 · Dashboard: http://localhost:8000/dashboard


Prerequisites

Platform Required Notes
macOS · Apple Silicon Python 3.10+ · Ollama Metal GPU · automatic unified memory detection
macOS · Intel Python 3.10+ · Ollama CPU inference · no VRAM pressure
Linux · NVIDIA Python 3.10+ · Ollama · CUDA 11.8+ Full VRAM telemetry via pynvml
Linux · AMD Python 3.10+ · Ollama · ROCm 5.0+ ROCm telemetry · llama.cpp recommended
Linux · CPU Python 3.10+ · Ollama CPU fallback · cloud spillover active
Windows · NVIDIA Python 3.10+ · Ollama · CUDA 11.8+ WSL2 recommended
Windows · CPU Python 3.10+ · Ollama CPU fallback
Docker · any Docker Desktop or Engine No Python install needed

Ollama: ollama.com/download


Subscription inheritance — detailed

NeuralBroker auto-discovers auth in priority order:

Source Provider Type Cost
~/.claude/.credentials.json Claude Pro/Max via claude CLI OAuth bearer $0
~/.codex/auth.json (tokens) ChatGPT via Codex OAuth bearer $0 (roadmap)
~/.codex/auth.json (api_key) OpenAI API key per-token
ANTHROPIC_API_KEY Anthropic API API key per-token
OPENAI_API_KEY OpenAI API API key per-token
GROQ_API_KEY Groq API key per-token
localhost:11434 Ollama none electricity only
localhost:8080 llama.cpp none electricity only

Subscription tokens always override env API keys for the same provider. They're effectively free at the margin — the subscription is already paid.

Check what was discovered:

curl http://localhost:8000/nb/discovered

Disable auto-discovery:

NB_DISABLE_AUTO_DISCOVERY=1 neuralbrok start

Routing modes

cost (default)

Routes local when VRAM free. Spills to cheapest cloud (subscription first, then cheapest paid).

speed

Always local for minimum latency. Cloud only on local failure.

fallback

Prefer local. Fall back on OOM or repeated error within 30s. Resumes local when healthy.

smart

Classifies each prompt (code, reasoning, RAG, fast response, long context, tools, chat). Runs SmartModelSelector to score all runnable local models. Picks best match. Falls back to cloud only on failure.

routing:
  default_mode: cost  # cost | speed | fallback | smart

SmartModelSelector

Scores every model that fits in available VRAM per request:

Signal Weight
Parameter count baseline score
Workload capability match +15 per matching tag
Workload recommended_for match +20 per matching tag
tok/s on your hardware (>60) +10
tok/s on your hardware (>30) +5
tok/s on your hardware (<10) −10
Long-context request + ctx ≥128k +25
VRAM headroom (free − model) ×2
MoE model + fast_response workload +15

Scores normalized 0–100%. Top scorer routes.


API

Endpoint Description
POST /v1/chat/completions OpenAI-compatible chat completions
POST /v1/messages Anthropic-compatible messages (auto-translated)
POST /v1/completions Text completions
GET /health Status + uptime
GET /nb/stats Routing stats, fallbacks, avg latency
GET /nb/discovered Auto-discovered auth state
GET /metrics Prometheus scrape endpoint
GET /dashboard Live routing dashboard

Providers

32 total: 20 Pattern-A (OpenAI-compatible) · 8 Pattern-B (custom adapter) · 4 local runtimes · Claude Code subprocess

Subscription-backed (no API key)

Provider Auth source Cost
Claude Pro/Max via claude CLI ~/.claude/.credentials.json $0
ChatGPT via Codex (roadmap) ~/.codex/auth.json $0
Ollama (all local models) localhost:11434 electricity
llama.cpp localhost:8080 electricity

Pattern A — OpenAI-compatible

Provider Notes
OpenAI GPT-4o, GPT-4o mini
Groq Llama 4, Qwen3-32B — fastest via LPU
Together AI DeepSeek V3, Llama 4, Qwen3-235B
Cerebras Llama 3.3 70B — wafer-scale throughput
DeepInfra DeepSeek V3, Qwen3-235B — cheapest per-token
Fireworks AI Llama 4, DeepSeek V3, Qwen3
Lepton AI Llama 3.3 70B, Qwen3 — serverless GPU
Novita AI Qwen3 and DeepSeek at lowest pricing
Hyperbolic Llama 4, Qwen3-235B — decentralized GPU
Mistral AI Mistral Small, Large, Codestral
Kimi (Moonshot) Kimi K2 — 1M token context
DeepSeek DeepSeek V3, R1 — best price/performance for coding
Qwen (DashScope) Qwen3-235B-A22B, Qwen3-Coder, QwQ-32B
Yi (01.AI) Yi-Lightning — strong multilingual
Baichuan Baichuan4 — Chinese language
Zhipu (GLM-4) GLM-4
Perplexity Sonar Pro — live web search built in
AI21 Labs Jamba 1.5 — SSM-Transformer, long context
OctoAI Auto-scaling serverless burst
OpenRouter 100+ models — last-resort fallback

Pattern B — Custom translation layer

Provider Notes
Anthropic (API key) Claude Sonnet, Haiku — when no Pro sub
Google Gemini Gemini 1.5 Pro, Flash — 1M token context
Cohere Command-R+ — enterprise RAG
Replicate Any open model including fine-tunes
Cloudflare AI Edge inference, global
AWS Bedrock Claude, Llama 4, Amazon Nova — data residency
Azure OpenAI GPT-4o — Microsoft enterprise
Google Vertex Gemini via GCP — private endpoint

Integrations

23 AI coding agents auto-configured via neuralbrok integrations setup <name>:

Agent Config Command
Claude Code .claude/settings.json neuralbrok integrations setup claude-code
Cursor .cursor/mcp.json neuralbrok integrations setup cursor
Cline .cline/settings.json neuralbrok integrations setup cline
GitHub Copilot .vscode/settings.json neuralbrok integrations setup github-copilot
Gemini CLI .gemini/settings.json neuralbrok integrations setup gemini-cli
OpenCode opencode.json neuralbrok integrations setup opencode
Warp ~/.warp/preferences.yaml neuralbrok integrations setup warp
Codex .env + ~/.codex/config.json neuralbrok integrations setup codex
Amp ~/.amp/config.json neuralbrok integrations setup amp
Kimi Code .kimi/config.json neuralbrok integrations setup kimi-code
Firebender .firebender/config.json neuralbrok integrations setup firebender
Deep Agents .deepagent/config.json neuralbrok integrations setup deep-agents
Windsurf, Trae, Cursor, Kilo Code, Qwen Code + more skill files neuralbrok integrations setup <name>

Setup TUI (optional)

neuralbrok setup runs a guided terminal UI for users who want manual configuration:

  • Detects GPU vendor, model, VRAM (or Apple unified memory)
  • Profiles every installed Ollama model — VRAM fit, tok/s, compatibility score
  • Manual mode: select exact models by number from ranked list
  • Algorithm self-test: shows which model routes to 4 sample prompts and why
  • Writes ~/.neuralbrok/config.yaml

Roadmap

Shipped

  • Zero-config subscription inheritance — Claude Pro/Max OAuth auto-discovery, no API key
  • Claude Code subprocess provider — Pro subscription as a free inference backend
  • /v1/messages Anthropic endpoint — full wire format translation
  • /nb/discovered — inspect auto-discovered auth at runtime
  • VRAM-aware routing — GPU polling, 4 routing modes, cost formula
  • SmartModelSelector — per-request model scoring
  • Dashboard v2 — live routing waterfall, mode switching, cost graph
  • neuralbrok doctor — diagnose config, test connectivity, benchmark local
  • neuralbrok code — launch Claude Code with NB routing context
  • 23 agent integrations auto-configured via CLI

Phase 2

  • Subagent decomposition — planner model breaks task into subtasks routed to different models, synthesizer merges
  • Model council — N models debate a response, moderator merges, disagreement logged as metadata
  • ChatGPT subscription routing — chatgpt.com OAuth via Codex for $0 GPT-4o
  • Prompt caching — detect repeated system prompts, route to providers with cache discount
  • Per-model cost tracking — log token spend per model per day, budget alerts

Phase 3

  • Dynamic provider weighting — auto-demote slow or error-prone providers
  • Fine-grained routing rules — route by model name, tag, or regex in config.yaml
  • Multi-GPU VRAM aggregation — per-GPU model pinning for multi-card setups
  • Privacy-tier routing — PII detection forces local-only path
  • Token budget enforcement — daily $ cap auto-downgrades models

Development

uvicorn src.neuralbrok.main:app --reload --host 0.0.0.0 --port 8000
pytest -q

Contributing

Bug reports, routing ideas, provider integrations, docs: GitHub Issues

MIT License — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

neuralbrok-0.8.1.tar.gz (118.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

neuralbrok-0.8.1-py3-none-any.whl (134.9 kB view details)

Uploaded Python 3

File details

Details for the file neuralbrok-0.8.1.tar.gz.

File metadata

  • Download URL: neuralbrok-0.8.1.tar.gz
  • Upload date:
  • Size: 118.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for neuralbrok-0.8.1.tar.gz
Algorithm Hash digest
SHA256 1cf645134a38ec7e4b171423fc85ec6b4909ce0352560c467c80ae7d3aa04a64
MD5 bee821bcf75acf71b3ce6471f2a7457a
BLAKE2b-256 cd20a12916f12544263be62e27c5303128d0a4ed42f4f94aebaeae87eceff36d

See more details on using hashes here.

Provenance

The following attestation bundles were made for neuralbrok-0.8.1.tar.gz:

Publisher: pypi-publish.yml on khan-sha/neuralbroker

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file neuralbrok-0.8.1-py3-none-any.whl.

File metadata

  • Download URL: neuralbrok-0.8.1-py3-none-any.whl
  • Upload date:
  • Size: 134.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for neuralbrok-0.8.1-py3-none-any.whl
Algorithm Hash digest
SHA256 dea867f9c9fc8bd5dca4875a8b02cf16ef38177aeec971d683622870c130d59d
MD5 b2b999c99303fd6d42ae2a61bf8a1887
BLAKE2b-256 8cad812393db3c6484a10a74a59726cb20358c5b7aac4f634492069be56ea84d

See more details on using hashes here.

Provenance

The following attestation bundles were made for neuralbrok-0.8.1-py3-none-any.whl:

Publisher: pypi-publish.yml on khan-sha/neuralbroker

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page