NVHive — Multi-LLM orchestration platform with intelligent routing, hive consensus, and auto-agent generation
Project description
nvHive
Multi-provider LLM routing that learns from every query.
Why nvHive
Most AI tools use a single provider. When that provider hits rate limits, changes pricing, or goes down, you're stuck.
nvHive routes queries to the best available provider automatically. It tracks which providers actually perform well for which task types, and adjusts routing based on measured quality — not static config.
What makes it different:
- Learns from every query. The router measures real provider performance. By 20 queries it's routing based on data, not guesses.
- Council consensus. 3+ models collaborate and synthesize. Run Nemotron + Gemma 4 locally for fully private council, or mix local + cloud.
- Confidence-gated escalation. Tries a free model first. Escalates to premium only if the response is uncertain.
- Cross-model verification. A second model independently checks for errors and hallucinations.
Agentic Coding (Beta)
New in 0.10.0 — a multi-model coding agent that plans, executes, and verifies code changes. Scales based on your GPU.
# One-time setup: pulls the right models for your GPU
nvh agent --setup
# Run a coding task
nvh agent "Fix the streaming timeout bug in council.py"
nvh agent "Add unit tests for the auth middleware" --dir ./myproject
nvh agent "Refactor the router to use health-aware selection" -y
How it works: A strong model (cloud or local 70B) plans the task, a local model executes using file and shell tools, then the planner verifies the result. Three phases: plan → execute → verify.
Scales with your hardware — 6 tiers from no-GPU to DGX Spark:
| GPU | VRAM | Tier | Models | Mode |
|---|---|---|---|---|
| DGX Spark | 128 GB | Tier 5 | Nemotron 70B + Llama 70B + Qwen 72B (3 models, all local) | Multi |
| RTX 6000 Pro BSE | 96 GB | Tier 4 | Cloud planner + Llama 70B coder + Qwen 32B reviewer (dual local) | Multi |
| A100 / A6000 | 48-80 GB | Tier 3 | Cloud planner + Llama 70B coder (--mode multi for dual local) |
Auto |
| RTX 3090 / 4090 | 24 GB | Tier 2 | Cloud planner + Gemma 2 27B coder | Single |
| RTX 4060 Ti | 16 GB | Tier 1 | Cloud planner + Qwen Coder 7B | Single |
| No GPU | — | Tier 0 | Fully cloud | Single |
nvh agent --setup # pull recommended models
nvh agent --remove # clean up models
nvh agent "task" --mode multi # force multi-model (Tier 3+)
nvh agent "task" --mode single # force single model
nvh agent "task" --git # auto-branch + commit changes
nvh agent "task" --no-quality # skip lint/syntax gates
Multi-model mode (Tier 4-5, or --mode multi on Tier 3): a DIFFERENT model reviews the coder's output, catching bugs the coder's architecture has blind spots for. Cross-model verification is measurably better than self-review.
Quality gates: after the agent modifies files, ruff lint + syntax checks run automatically. If they fail, the agent gets the errors and fixes them in a feedback loop.
Get Started
pip install nvhive
nvh setup # configure providers (validates keys)
nvh health # check what's available
nvh "your question" # try it
Works immediately with LLM7 (no signup). Run nvh setup to add free providers like Groq and GitHub Models.
NVIDIA GPU Quick Start — local inference on your hardware
# 1. Install Ollama + Nemotron
curl -fsSL https://ollama.com/install.sh | sh
ollama pull nemotron-mini # 4.1GB, runs on 8GB+ VRAM
# 2. Install nvHive
pip install nvhive
# 3. nvHive auto-detects your GPU and Nemotron
nvh nvidia # GPU info + inference stack status
nvh bench # benchmark your GPU (tokens/sec)
# 4. Queries route to your GPU by default
nvh "Explain quicksort" # → local Nemotron, $0, private
nvh safe "Analyze this code" # → forced local, nothing leaves machine
nvh --prefer-nvidia "question" # → 1.3x bonus for NVIDIA providers
# 5. Council on your GPU — 3 models, $0, fully private
nvh convene "Redis vs Postgres for sessions?"
nvHive detects NVIDIA GPUs via pynvml (VRAM, driver, CUDA version, temperature, power draw) and selects the optimal Nemotron model for your hardware. Simple queries stay local. Complex queries escalate to cloud only when needed. The learning loop measures your GPU's quality over time and adjusts routing thresholds automatically.
How It Works
Query Pipeline
flowchart TB
USER[User Query] --> CLASSIFY[Task Classifier<br/>TF-IDF · 13 task types]
CLASSIFY --> LOCALCHECK{Local GPU<br/>good enough?}
LOCALCHECK -->|Simple query| GPU[NVIDIA GPU via Ollama<br/>Nemotron + Gemma 4<br/>Two architectures locally]
LOCALCHECK -->|Complex query| SCORE[Score All Providers<br/>capability · cost · latency · health]
SCORE --> ROUTE{Pick Best<br/>Provider}
ROUTE --> FREE[Free Providers<br/>LLM7 · Groq · GitHub]
ROUTE --> PAID[Premium Providers<br/>OpenAI · Anthropic · Google]
ROUTE --> NIM[NVIDIA NIM<br/>Triton]
ROUTE --> GPU
FREE --> RESPONSE[Response]
PAID --> RESPONSE
NIM --> RESPONSE
GPU --> RESPONSE
RESPONSE --> LEARN[Learning Loop<br/>Record outcome · EMA update<br/>Adjusts GPU routing thresholds]
LEARN -->|Feeds back into| SCORE
RESPONSE -->|--verify flag| VERIFY[Cross-Model<br/>Verification]
VERIFY --> FINAL[Verified Response]
RESPONSE --> FINAL
style GPU fill:#76B900,color:#000
style NIM fill:#76B900,color:#000
style LEARN fill:#1a1a2e,color:#76B900,stroke:#76B900
style VERIFY fill:#1a1a2e,color:#00bcd4,stroke:#00bcd4
Task classification: TF-IDF cosine similarity against a 90-example training corpus (13 task types). Semantic understanding, not keyword matching.
Provider scoring: Weighted composite — capability (40%), cost (30%), latency (20%), health (10%). Capability scores start from static estimates and converge to measured performance via exponential moving average.
Adaptive learning: After every query, nvHive records the outcome and updates scores. By 20 queries per provider/task pair, routing is fully data-driven.
nvh routing-stats # see learned vs static scores
nvh health # provider resilience dashboard
Failover: If a provider fails, nvHive tries the next in the fallback chain. Every failure feeds back into the health score.
Local-first with NVIDIA GPUs: Simple queries route to Nemotron on your NVIDIA GPU via Ollama — no cloud, no cost, no data leaving your machine. GPU detection via pynvml reads VRAM, driver version, and CUDA version to select the optimal local model. The --prefer-nvidia flag gives a 1.3x routing bonus to keep inference on NVIDIA hardware whenever quality allows.
Council Mode
flowchart TB
QUERY[User Query] --> AGENTS[Generate Expert Personas<br/>e.g. Backend Engineer, Architect, DBA]
AGENTS --> M1[Model 1<br/>Groq / Llama]
AGENTS --> M2[Model 2<br/>Google / Gemini]
AGENTS --> M3[Model 3<br/>GitHub / GPT-4o]
M1 --> COLLECT[Collect Responses<br/>Rate-limit staggered]
M2 --> COLLECT
M3 --> COLLECT
COLLECT --> AGREE[Agreement Analysis<br/>Keyword overlap + LLM judge]
AGREE --> SYNTH[Synthesis<br/>Uses non-member provider]
SYNTH --> RESULT[Council Response<br/>+ Confidence Score<br/>+ Individual Perspectives]
style AGREE fill:#1a1a2e,color:#00bcd4,stroke:#00bcd4
style SYNTH fill:#1a1a2e,color:#76B900,stroke:#76B900
When one model isn't enough, nvHive runs the same query through multiple providers in parallel, then synthesizes their responses.
Why this works: Different models have different blind spots. Council mode surfaces all perspectives and synthesizes the best of each.
Confidence scoring: Every council response includes an agreement metric — "3/3 agreed" vs "split decision." Tells you when to trust the consensus.
Cost: Council with 3 free providers costs $0. Council with 3 Nemotron variants on a single NVIDIA GPU costs $0 and never leaves your machine. Premium cloud council costs ~3x a single query.
nvh convene "Should we use Redis or Postgres for session storage?"
# → 3 models debate → synthesis with confidence score
Throwdown Mode — Two-Pass Deep Analysis
Throwdown goes beyond council. Three passes, each building on the last:
flowchart TB
QUERY[User Query] --> A1[Expert 1 - Nemotron<br/>local GPU]
QUERY --> A2[Expert 2 - Gemma 4<br/>local GPU]
QUERY --> A3[Expert 3 - Groq<br/>cloud free]
A1 --> S1[Pass 1 Synthesis]
A2 --> S1
A3 --> S1
S1 --> B1[Expert 1 - Critiques]
S1 --> B2[Expert 2 - Finds blind spots]
S1 --> B3[Expert 3 - Challenges assumptions]
B1 --> S2[Pass 2 Synthesis]
B2 --> S2
B3 --> S2
S2 --> FINAL[Final Answer]
style A1 fill:#1a1a2e,stroke:#76B900,color:#c8c8c8
style A2 fill:#1a1a2e,stroke:#76B900,color:#c8c8c8
style A3 fill:#1a1a2e,stroke:#76B900,color:#c8c8c8
style B1 fill:#1a1a2e,stroke:#00bcd4,color:#c8c8c8
style B2 fill:#1a1a2e,stroke:#00bcd4,color:#c8c8c8
style B3 fill:#1a1a2e,stroke:#00bcd4,color:#c8c8c8
style FINAL fill:#76B900,color:#000
nvh throwdown "Review this architecture for scalability issues"
# Pass 1: 3 experts analyze independently
# Pass 2: experts critique each other's analysis
# Pass 3: final synthesis integrating all perspectives
Why throwdown beats single-model: A single model gives you one perspective, once. Throwdown gives you three perspectives, challenged by three critiques, then synthesized. Errors get caught. Assumptions get questioned. The final answer is more thorough than any single pass.
Smart Query Features
# Confidence-gated escalation: try free first, upgrade only if needed
nvh ask --escalate "Design a distributed lock manager"
# → groq (free, confidence: 42%) → auto-escalated to openai
# Cross-model verification: a second model checks the answer
nvh ask --verify "Is eval() safe in Python?"
# → groq answers → google verifies ✓ (9/10, no issues)
# Both together: cheapest possible verified answer
nvh ask --escalate --verify "Explain the CAP theorem"
Local GPU Inference with Nemotron
nvh setup detects your NVIDIA GPU, selects which models fit in your VRAM, and pulls them automatically. Supports both NVIDIA Nemotron and Google Gemma 4 (NVIDIA-optimized) for local council with two different architectures.
nvh setup
# Step 3/3: Local GPU inference
# Detected: NVIDIA GeForce RTX 4090 (24GB VRAM)
# Models: nemotron-small, gemma4:26b
# Pulling nemotron-small... ✓
# Pulling gemma4:26b... ✓
# Local council ready — multiple models for consensus
What nvh setup handles:
flowchart TB
SETUP[nvh setup] --> DETECT[GPU Detection<br/>pynvml reads VRAM · driver · CUDA]
DETECT --> VRAM{Available VRAM?}
VRAM -->|< 6 GB| MINI[nemotron-mini<br/>+ gemma4:e2b]
VRAM -->|6 – 12 GB| SMALL[nemotron-small<br/>+ gemma4:e4b]
VRAM -->|12 – 48 GB| CHOICE{User choice}
VRAM -->|48 GB+| FULL[nemotron 70B<br/>+ gemma4:31b]
CHOICE -->|Both for council| DUAL[nemotron-small<br/>+ gemma4:26b]
CHOICE -->|Single model| SINGLE[nemotron 70B only]
MINI --> CHECK{Ollama running?}
SMALL --> CHECK
DUAL --> CHECK
SINGLE --> CHECK
FULL --> CHECK
CHECK -->|Not installed| INSTALL[Show install command]
CHECK -->|Not running| START[Show: ollama serve]
CHECK -->|Running| PULL[Auto-pull all<br/>models that fit]
PULL --> READY[Ready ✓<br/>Local council enabled]
READY --> ROUTE[nvHive Router<br/>Two model architectures<br/>Learning loop active]
style SMALL fill:#76B900,color:#000
style DUAL fill:#76B900,color:#000
style READY fill:#76B900,color:#000
style ROUTE fill:#1a1a2e,color:#76B900,stroke:#76B900
After setup, routing is automatic:
- Simple queries → local Nemotron or Gemma 4 on your GPU (free, private)
- Council mode → both models collaborate locally, catching different blind spots
- Complex queries → cloud providers when local quality isn't sufficient
nvh benchmeasures your GPU's actual tok/s with community baselines- The learning loop measures each model's quality on YOUR hardware
Full GPU detection + VRAM guide
NVIDIA Inference Stack
| Layer | Technology | Hardware | Use Case |
|---|---|---|---|
| Local | Ollama + Nemotron | Consumer GPUs (RTX 3060+) | Default local inference, privacy mode |
| Local | Ollama + Gemma 4 | Consumer GPUs (RTX 3060+) | NVIDIA-optimized, reasoning + multimodal |
| Cloud | NVIDIA NIM API | NVIDIA cloud | Specialized models, 1000 free credits |
| Enterprise | Triton Inference Server | H100 / A100 / L40 | Production multi-model serving, TensorRT-LLM |
| Agent | NemoClaw / OpenShell | Any | Agent orchestration with nvHive routing |
| Detection | pynvml | Any NVIDIA GPU | VRAM, driver, CUDA, temp, power, PCIe |
--prefer-nvidia gives a 1.3x routing bonus to all NVIDIA-backed providers, keeping inference on NVIDIA hardware whenever quality allows.
Integrations
How nvHive Connects to Your Tools
flowchart LR
subgraph Your Tools
CLI[nvh CLI]
SDK[Python SDK<br/>import nvh]
CC[Claude Code<br/>MCP]
OC[OpenClaw<br/>Agent]
NC[NemoClaw<br/>Agent]
CU[Cursor]
APP[Your App<br/>OpenAI SDK]
end
subgraph nvHive Engine
API[API Server<br/>:8000]
MCP[MCP Server<br/>stdio]
PROXY_OAI[OpenAI Proxy<br/>/v1/proxy]
PROXY_ANT[Anthropic Proxy<br/>/v1/anthropic]
ROUTER[Adaptive Router<br/>+ Learning Loop]
COUNCIL[Council Engine<br/>+ Confidence]
ESCALATE[Escalation<br/>+ Verification]
end
subgraph Providers
GPU[Your GPU<br/>Ollama · Nemotron]
FREE_P[Free Cloud<br/>Groq · GitHub · LLM7<br/>Google · Cerebras]
PAID_P[Paid Cloud<br/>OpenAI · Anthropic<br/>DeepSeek · Mistral]
NIM[NVIDIA NIM<br/>Triton]
end
CLI --> API
SDK --> API
CC --> MCP
OC --> MCP
NC --> PROXY_OAI
CU --> MCP
APP --> PROXY_OAI
APP --> PROXY_ANT
MCP --> API
PROXY_OAI --> API
PROXY_ANT --> API
API --> ROUTER
API --> COUNCIL
API --> ESCALATE
ROUTER --> GPU
ROUTER --> FREE_P
ROUTER --> PAID_P
ROUTER --> NIM
style GPU fill:#76B900,color:#000
style NIM fill:#76B900,color:#000
style ROUTER fill:#1a1a2e,color:#76B900,stroke:#76B900
style COUNCIL fill:#1a1a2e,color:#00bcd4,stroke:#00bcd4
API Proxies — point existing SDKs at nvHive:
| SDK | Configuration |
|---|---|
| Anthropic | ANTHROPIC_BASE_URL=http://localhost:8000/v1/anthropic |
| OpenAI | OPENAI_BASE_URL=http://localhost:8000/v1/proxy |
| Claude Code | claude mcp add nvhive -- python -m nvh.mcp_server |
| Cursor | nvh integrate --auto |
Works With OpenClaw & NemoClaw
nvHive works alongside OpenClaw as a routing layer, and integrates with NemoClaw (NVIDIA's agent framework) as both inference provider and MCP tool server.
nvh migrate --from openclaw # import your existing API keys
nvh nemoclaw --start # start proxy for NemoClaw agents
Note: Anthropic recently changed billing for third-party tools. See the integration guide for details.
For Tool Builders
nvHive is a routing layer. Any AI application can add multi-provider routing:
import nvh
# Drop-in OpenAI-compatible interface
response = await nvh.complete([
{"role": "user", "content": "Explain quicksort"}
])
# Inspect routing without executing
decision = await nvh.route("complex question about databases")
# Council consensus
result = await nvh.convene("Architecture review", cabinet="engineering")
# Provider health check
status = await nvh.health()
Core Commands
| Command | What It Does |
|---|---|
nvh "question" |
Smart route to best available model |
nvh convene "question" |
Council consensus (3+ models) |
nvh throwdown "question" |
Two-pass deep analysis with critique |
nvh safe "question" |
Local only — nothing leaves your machine |
nvh ask --escalate |
Try free first, escalate if uncertain |
nvh ask --verify |
Cross-model verification |
nvh health |
Provider resilience dashboard |
nvh why |
Explain last routing decision |
nvh history |
Recent queries with costs and timing |
nvh bench |
GPU speed test (tokens/sec) |
nvh bench -q |
Speed + quality comparison |
nvh routing-stats |
Learned vs static routing scores |
nvh nvidia |
NVIDIA GPU infrastructure status |
nvh migrate |
Import keys from OpenClaw / Claude Desktop |
nvh setup |
Interactive provider setup |
Full command reference (50+ commands)
Providers
23 providers. 63 models. 25 free — no credit card required.
| Tier | Providers | Rate Limits |
|---|---|---|
| Free (no signup) | Ollama (local), LLM7 | Unlimited / 30 RPM |
| Free (email signup) | Groq, GitHub Models, Cerebras, SambaNova, Cohere, AI21, SiliconFlow, HuggingFace | 15-30 RPM |
| Free (account) | Google Gemini, Mistral, NVIDIA NIM | 15-1000 RPM |
| Paid | OpenAI, Anthropic, DeepSeek, Fireworks, Together, OpenRouter, Grok | Pay per token |
Benchmark Results
Real data from NVIDIA DGX Spark (GB10, 120GB). Judged by OpenAI with ground truth verification on math prompts.
Quality: Council vs Single Model
| Mode | Accuracy | Completeness | Coherence | Overall | Cost |
|---|---|---|---|---|---|
| Single Model (Nemotron Super) | 5.5 | 5.7 | 5.0 | 5.1 | $0.00 |
| Council (Free: Ollama + Groq + Google) | 9.0 | 8.0 | 9.0 | 8.6 | $0.00 |
Council consensus scored 68% higher than a single model on the same prompts. Ground truth verification on math problems caught errors the single model made that an LLM judge alone wouldn't have flagged.
Speed: Models on DGX Spark
| Model | Size | tok/s |
|---|---|---|
| gemma3 | 3.3 GB | 119.3 |
| nemotron-mini | 2.7 GB | 85.7 |
| gemma4 (e4b) | 9.6 GB | 61.7 |
| llama3.1 | 4.9 GB | 48.2 |
| nemotron-3-super | 86 GB | 23.6 |
Run It Yourself
nvh bench # GPU speed (tokens/sec)
nvh bench -q # speed + quality comparison
nvh health # provider resilience
nvh why # explain last routing decision
nvh estimate --gpu rtx_4090 # predict tok/s on any GPU
16 prompts across code generation, debugging, reasoning, math, creative writing, and Q&A. LLM judge + ground truth verification. Run it yourself. Publish the results.
Learn More
| Guide | Description |
|---|---|
| Getting Started | First-time setup |
| Commands | Full CLI reference (50+ commands) |
| Providers | 23 providers, rate limits, free tiers |
| Council System | Multi-LLM consensus with confidence scoring |
| OpenClaw Integration | Works alongside OpenClaw |
| Claude Code | MCP server setup |
| GPU Detection | Auto-detection, model selection, OOM protection |
| NemoClaw | NVIDIA NemoClaw integration |
| SDK & API | Python SDK, REST API, proxies |
| Architecture | System design and adaptive learning |
License
MIT License. See LICENSE for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file nvhive-0.11.1.tar.gz.
File metadata
- Download URL: nvhive-0.11.1.tar.gz
- Upload date:
- Size: 420.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
01a92039efd1570a8a18824c53a52d8377e94acfe25e96e264d32a74fb33a601
|
|
| MD5 |
5239e5306e1a2f71f9703aa98130906e
|
|
| BLAKE2b-256 |
87e7a2a44f25d51f16e75fe823ff5d9816fa347db24148f3453d4f947b5e015d
|
File details
Details for the file nvhive-0.11.1-py3-none-any.whl.
File metadata
- Download URL: nvhive-0.11.1-py3-none-any.whl
- Upload date:
- Size: 434.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b71daca0e1940d287b4e26a1d7496950421a6859c96065f6cec53981c804fb9b
|
|
| MD5 |
8da90327f540f78b0d4c8712654c0347
|
|
| BLAKE2b-256 |
3351284b46675828ffd672181d9416eb162824ee3497572c497313f548f69acc
|