VRAM-aware LLM routing daemon — local-first, OpenAI-compatible

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

khan-ash

These details have not been verified by PyPI

Project links

Homepage

Project description

NeuralBroker

VRAM-aware LLM routing daemon · local-first · OpenAI-compatible

Your GPU first. Your subscription next. Zero new API keys.

Python License: MIT FastAPI

The idea in one sentence

You already pay for Claude Pro. NeuralBroker lets every app on your machine use it — automatically, for free, without ever touching an API key.

pip install neuralbrok
neuralbrok start

That's it. NeuralBroker auto-detects your installed Claude Code OAuth session, ChatGPT auth, Ollama models, and any environment API keys — then presents them as a single OpenAI-compatible endpoint at localhost:8000/v1. Point any tool at that URL and routing is live.

What no one else does

1. Subscription inheritance — turn a $20/month sub into a free API

NeuralBroker reads the OAuth session your Claude Code CLI already holds in ~/.claude/.credentials.json. It shells out to the claude binary to answer requests. No token copying. No API key. Your Claude Pro or Max subscription covers it at zero marginal cost.

Auto-discovered on startup:
  ✓ Claude PRO subscription    ~/.claude/.credentials.json
  ✓ Ollama (v0.20.4)           localhost:11434
  ✓ ChatGPT subscription       ~/.codex/auth.json  (roadmap)

Any app that speaks OpenAI format now routes through your subscription:

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="any-string")
response = client.chat.completions.create(
    model="claude-sonnet-4-6",
    messages=[{"role": "user", "content": "Hello"}]
)
# Routed through your Claude Pro subscription. Cost: $0.

2. VRAM-aware local-first routing — your GPU before the cloud

NeuralBroker polls GPU state every 500ms. Requests go to your local Ollama models first. Cloud kicks in only when VRAM is under pressure or local fails. Your electricity bill, not Anthropic's servers.

Request arrives → check VRAM → qwen2.5:0.5b fits → route local → $0.00001
Request arrives → VRAM full → spill → claude_code subprocess → $0.00000 (Pro sub)
Request arrives → no local match → spill → cheapest cloud provider

3. Zero config — discovers everything automatically

No yaml required. On first start NeuralBroker:

Reads ~/.claude/.credentials.json → registers Claude Pro/Max via subprocess
Reads ~/.codex/auth.json → registers OpenAI or ChatGPT tokens
Scans environment for ANTHROPIC_API_KEY, OPENAI_API_KEY, GROQ_API_KEY, etc.
Pings localhost:11434 → registers Ollama if running
Pings localhost:8080 → registers llama.cpp if running

Then picks the cheapest path per request. You never write a config file.

4. Claude Code + Ollama hybrid — one endpoint for your whole dev stack

# NeuralBroker running in the background
neuralbrok start

# Claude Code routes through NB (simple tasks → local, hard tasks → Claude Pro)
ANTHROPIC_BASE_URL=http://localhost:8000/v1 claude

# Cursor, Cline, Codex — same endpoint
# neuralbrok integrations setup cursor
# neuralbrok integrations setup codex

Simple coding tasks → qwen2.5:0.5b locally, instant, free. Complex reasoning → claude-sonnet-4-6 via your Pro subscription, free. Cloud overflow → cheapest available provider.

Quickstart

Option A — pip install

pip install neuralbrok
neuralbrok start

Option B — from source

macOS / Linux:

git clone https://github.com/khan-sha/neuralbroker.git
cd neuralbroker
python3 -m venv .venv && source .venv/bin/activate
pip install -e .
neuralbrok start

Windows (PowerShell):

git clone https://github.com/khan-sha/neuralbroker.git
cd neuralbroker
python -m venv .venv; .venv\Scripts\Activate.ps1
pip install -e .
neuralbrok start

Option C — Docker

git clone https://github.com/khan-sha/neuralbroker.git
cd neuralbroker
cp .env.example .env
docker compose up -d

Proxy: http://localhost:8000/v1 · Dashboard: http://localhost:8000/dashboard

Prerequisites

Platform	Required	Notes
macOS · Apple Silicon	Python 3.10+ · Ollama	Metal GPU · automatic unified memory detection
macOS · Intel	Python 3.10+ · Ollama	CPU inference · no VRAM pressure
Linux · NVIDIA	Python 3.10+ · Ollama · CUDA 11.8+	Full VRAM telemetry via pynvml
Linux · AMD	Python 3.10+ · Ollama · ROCm 5.0+	ROCm telemetry · llama.cpp recommended
Linux · CPU	Python 3.10+ · Ollama	CPU fallback · cloud spillover active
Windows · NVIDIA	Python 3.10+ · Ollama · CUDA 11.8+	WSL2 recommended
Windows · CPU	Python 3.10+ · Ollama	CPU fallback
Docker · any	Docker Desktop or Engine	No Python install needed

Ollama: ollama.com/download

Subscription inheritance — detailed

NeuralBroker auto-discovers auth in priority order:

Source	Provider	Type	Cost
`~/.claude/.credentials.json`	Claude Pro/Max via `claude` CLI	OAuth bearer	$0
`~/.codex/auth.json` (tokens)	ChatGPT via Codex	OAuth bearer	$0 (roadmap)
`~/.codex/auth.json` (api_key)	OpenAI	API key	per-token
`ANTHROPIC_API_KEY`	Anthropic API	API key	per-token
`OPENAI_API_KEY`	OpenAI API	API key	per-token
`GROQ_API_KEY`	Groq	API key	per-token
`localhost:11434`	Ollama	none	electricity only
`localhost:8080`	llama.cpp	none	electricity only

Subscription tokens always override env API keys for the same provider. They're effectively free at the margin — the subscription is already paid.

Check what was discovered:

curl http://localhost:8000/nb/discovered

Disable auto-discovery:

NB_DISABLE_AUTO_DISCOVERY=1 neuralbrok start

Routing modes

cost (default)

Routes local when VRAM free. Spills to cheapest cloud (subscription first, then cheapest paid).

speed

Always local for minimum latency. Cloud only on local failure.

fallback

Prefer local. Fall back on OOM or repeated error within 30s. Resumes local when healthy.

smart

Classifies each prompt (code, reasoning, RAG, fast response, long context, tools, chat). Runs SmartModelSelector to score all runnable local models. Picks best match. Falls back to cloud only on failure.

routing:
  default_mode: cost  # cost | speed | fallback | smart

SmartModelSelector

Scores every model that fits in available VRAM per request:

Signal	Weight
Parameter count	baseline score
Workload capability match	+15 per matching tag
Workload recommended_for match	+20 per matching tag
tok/s on your hardware (>60)	+10
tok/s on your hardware (>30)	+5
tok/s on your hardware (<10)	−10
Long-context request + ctx ≥128k	+25
VRAM headroom (free − model)	×2
MoE model + fast_response workload	+15

Scores normalized 0–100%. Top scorer routes.

API

Endpoint	Description
`POST /v1/chat/completions`	OpenAI-compatible chat completions
`POST /v1/messages`	Anthropic-compatible messages (auto-translated)
`POST /v1/completions`	Text completions
`GET /health`	Status + uptime
`GET /nb/stats`	Routing stats, fallbacks, avg latency
`GET /nb/discovered`	Auto-discovered auth state
`GET /metrics`	Prometheus scrape endpoint
`GET /dashboard`	Live routing dashboard

Providers

32 total: 20 Pattern-A (OpenAI-compatible) · 8 Pattern-B (custom adapter) · 4 local runtimes · Claude Code subprocess

Subscription-backed (no API key)

Provider	Auth source	Cost
Claude Pro/Max via `claude` CLI	`~/.claude/.credentials.json`	$0
ChatGPT via Codex (roadmap)	`~/.codex/auth.json`	$0
Ollama (all local models)	localhost:11434	electricity
llama.cpp	localhost:8080	electricity

Pattern A — OpenAI-compatible

Provider	Notes
OpenAI	GPT-4o, GPT-4o mini
Groq	Llama 4, Qwen3-32B — fastest via LPU
Together AI	DeepSeek V3, Llama 4, Qwen3-235B
Cerebras	Llama 3.3 70B — wafer-scale throughput
DeepInfra	DeepSeek V3, Qwen3-235B — cheapest per-token
Fireworks AI	Llama 4, DeepSeek V3, Qwen3
Lepton AI	Llama 3.3 70B, Qwen3 — serverless GPU
Novita AI	Qwen3 and DeepSeek at lowest pricing
Hyperbolic	Llama 4, Qwen3-235B — decentralized GPU
Mistral AI	Mistral Small, Large, Codestral
Kimi (Moonshot)	Kimi K2 — 1M token context
DeepSeek	DeepSeek V3, R1 — best price/performance for coding
Qwen (DashScope)	Qwen3-235B-A22B, Qwen3-Coder, QwQ-32B
Yi (01.AI)	Yi-Lightning — strong multilingual
Baichuan	Baichuan4 — Chinese language
Zhipu (GLM-4)	GLM-4
Perplexity	Sonar Pro — live web search built in
AI21 Labs	Jamba 1.5 — SSM-Transformer, long context
OctoAI	Auto-scaling serverless burst
OpenRouter	100+ models — last-resort fallback

Pattern B — Custom translation layer

Provider	Notes
Anthropic (API key)	Claude Sonnet, Haiku — when no Pro sub
Google Gemini	Gemini 1.5 Pro, Flash — 1M token context
Cohere	Command-R+ — enterprise RAG
Replicate	Any open model including fine-tunes
Cloudflare AI	Edge inference, global
AWS Bedrock	Claude, Llama 4, Amazon Nova — data residency
Azure OpenAI	GPT-4o — Microsoft enterprise
Google Vertex	Gemini via GCP — private endpoint

Integrations

23 AI coding agents auto-configured via neuralbrok integrations setup <name>:

Agent	Config	Command
Claude Code	`.claude/settings.json`	`neuralbrok integrations setup claude-code`
Cursor	`.cursor/mcp.json`	`neuralbrok integrations setup cursor`
Cline	`.cline/settings.json`	`neuralbrok integrations setup cline`
GitHub Copilot	`.vscode/settings.json`	`neuralbrok integrations setup github-copilot`
Gemini CLI	`.gemini/settings.json`	`neuralbrok integrations setup gemini-cli`
OpenCode	`opencode.json`	`neuralbrok integrations setup opencode`
Warp	`~/.warp/preferences.yaml`	`neuralbrok integrations setup warp`
Codex	`.env + ~/.codex/config.json`	`neuralbrok integrations setup codex`
Amp	`~/.amp/config.json`	`neuralbrok integrations setup amp`
Kimi Code	`.kimi/config.json`	`neuralbrok integrations setup kimi-code`
Firebender	`.firebender/config.json`	`neuralbrok integrations setup firebender`
Deep Agents	`.deepagent/config.json`	`neuralbrok integrations setup deep-agents`
Windsurf, Trae, Cursor, Kilo Code, Qwen Code + more	skill files	`neuralbrok integrations setup <name>`

Setup TUI (optional)

neuralbrok setup runs a guided terminal UI for users who want manual configuration:

Detects GPU vendor, model, VRAM (or Apple unified memory)
Profiles every installed Ollama model — VRAM fit, tok/s, compatibility score
Manual mode: select exact models by number from ranked list
Algorithm self-test: shows which model routes to 4 sample prompts and why
Writes ~/.neuralbrok/config.yaml

Roadmap

Shipped

Zero-config subscription inheritance — Claude Pro/Max OAuth auto-discovery, no API key
Claude Code subprocess provider — Pro subscription as a free inference backend
/v1/messages Anthropic endpoint — full wire format translation
/nb/discovered — inspect auto-discovered auth at runtime
VRAM-aware routing — GPU polling, 4 routing modes, cost formula
SmartModelSelector — per-request model scoring
Dashboard v2 — live routing waterfall, mode switching, cost graph
neuralbrok doctor — diagnose config, test connectivity, benchmark local
neuralbrok code — launch Claude Code with NB routing context
23 agent integrations auto-configured via CLI

Phase 2

Subagent decomposition — planner model breaks task into subtasks routed to different models, synthesizer merges
Model council — N models debate a response, moderator merges, disagreement logged as metadata
ChatGPT subscription routing — chatgpt.com OAuth via Codex for $0 GPT-4o
Prompt caching — detect repeated system prompts, route to providers with cache discount
Per-model cost tracking — log token spend per model per day, budget alerts

Phase 3

Dynamic provider weighting — auto-demote slow or error-prone providers
Fine-grained routing rules — route by model name, tag, or regex in config.yaml
Multi-GPU VRAM aggregation — per-GPU model pinning for multi-card setups
Privacy-tier routing — PII detection forces local-only path
Token budget enforcement — daily $ cap auto-downgrades models

Development

uvicorn src.neuralbrok.main:app --reload --host 0.0.0.0 --port 8000
pytest -q

Contributing

Bug reports, routing ideas, provider integrations, docs: GitHub Issues

MIT License — see LICENSE.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

khan-ash

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

3.0.2

May 15, 2026

3.0.1

May 4, 2026

3.0.0

May 4, 2026

2.1.2

May 1, 2026

2.1.1

May 1, 2026

2.1.0

May 1, 2026

2.0.9

May 1, 2026

2.0.7

May 1, 2026

2.0.6

May 1, 2026

2.0.5

May 1, 2026

2.0.4

May 1, 2026

2.0.0

May 1, 2026

0.9.2

May 15, 2026

0.9.0

May 5, 2026

0.8.3

Apr 29, 2026

0.8.2

Apr 29, 2026

This version

0.8.1

Apr 29, 2026

0.8.0

Apr 29, 2026

0.7.5

Apr 26, 2026

0.7.4

Apr 26, 2026

0.7.3

Apr 25, 2026

0.7.2

Apr 25, 2026

0.7.1

Apr 25, 2026

0.7.0

Apr 25, 2026

0.6.7

Apr 25, 2026

0.6.6

Apr 25, 2026

0.6.5

Apr 25, 2026

0.6.4

Apr 25, 2026

0.6.3

Apr 25, 2026

0.6.2

Apr 25, 2026

0.6.1

Apr 25, 2026

0.6.0

Apr 25, 2026

0.5.3

Apr 24, 2026

0.5.2

Apr 24, 2026

0.5.0

Apr 24, 2026

0.4.4

Apr 24, 2026

0.4.3

Apr 24, 2026

0.4.2

Apr 24, 2026

0.4.0

Apr 23, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

neuralbrok-0.8.1.tar.gz (118.0 kB view details)

Uploaded Apr 29, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

neuralbrok-0.8.1-py3-none-any.whl (134.9 kB view details)

Uploaded Apr 29, 2026 Python 3

File details

Details for the file neuralbrok-0.8.1.tar.gz.

File metadata

Download URL: neuralbrok-0.8.1.tar.gz
Upload date: Apr 29, 2026
Size: 118.0 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for neuralbrok-0.8.1.tar.gz
Algorithm	Hash digest
SHA256	`1cf645134a38ec7e4b171423fc85ec6b4909ce0352560c467c80ae7d3aa04a64`
MD5	`bee821bcf75acf71b3ce6471f2a7457a`
BLAKE2b-256	`cd20a12916f12544263be62e27c5303128d0a4ed42f4f94aebaeae87eceff36d`

See more details on using hashes here.

Provenance

The following attestation bundles were made for neuralbrok-0.8.1.tar.gz:

Publisher: pypi-publish.yml on khan-sha/neuralbroker

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: neuralbrok-0.8.1.tar.gz
- Subject digest: 1cf645134a38ec7e4b171423fc85ec6b4909ce0352560c467c80ae7d3aa04a64
- Sigstore transparency entry: 1399070435
- Sigstore integration time: Apr 29, 2026
Source repository:
- Permalink: khan-sha/neuralbroker@0d352a0253c7096f119d977cde06f97ad503b820
- Branch / Tag: refs/tags/v0.8.1
- Owner: https://github.com/khan-sha
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: pypi-publish.yml@0d352a0253c7096f119d977cde06f97ad503b820
- Trigger Event: push

File details

Details for the file neuralbrok-0.8.1-py3-none-any.whl.

File metadata

Download URL: neuralbrok-0.8.1-py3-none-any.whl
Upload date: Apr 29, 2026
Size: 134.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for neuralbrok-0.8.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`dea867f9c9fc8bd5dca4875a8b02cf16ef38177aeec971d683622870c130d59d`
MD5	`b2b999c99303fd6d42ae2a61bf8a1887`
BLAKE2b-256	`8cad812393db3c6484a10a74a59726cb20358c5b7aac4f634492069be56ea84d`

See more details on using hashes here.

Provenance

The following attestation bundles were made for neuralbrok-0.8.1-py3-none-any.whl:

Publisher: pypi-publish.yml on khan-sha/neuralbroker

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: neuralbrok-0.8.1-py3-none-any.whl
- Subject digest: dea867f9c9fc8bd5dca4875a8b02cf16ef38177aeec971d683622870c130d59d
- Sigstore transparency entry: 1399070444
- Sigstore integration time: Apr 29, 2026
Source repository:
- Permalink: khan-sha/neuralbroker@0d352a0253c7096f119d977cde06f97ad503b820
- Branch / Tag: refs/tags/v0.8.1
- Owner: https://github.com/khan-sha
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: pypi-publish.yml@0d352a0253c7096f119d977cde06f97ad503b820
- Trigger Event: push

neuralbrok 0.8.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

NeuralBroker

VRAM-aware LLM routing daemon · local-first · OpenAI-compatible

Your GPU first. Your subscription next. Zero new API keys.

The idea in one sentence

What no one else does

1. Subscription inheritance — turn a $20/month sub into a free API

2. VRAM-aware local-first routing — your GPU before the cloud

3. Zero config — discovers everything automatically

4. Claude Code + Ollama hybrid — one endpoint for your whole dev stack

Quickstart

Option A — pip install

Option B — from source

Option C — Docker

Prerequisites

Subscription inheritance — detailed

Routing modes

cost (default)

speed

fallback

smart

SmartModelSelector

API

Providers

Subscription-backed (no API key)

Pattern A — OpenAI-compatible

Pattern B — Custom translation layer

Integrations

Setup TUI (optional)

Roadmap

Shipped

Phase 2

Phase 3

Development

Contributing

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance