Skip to main content

NVHive — Multi-LLM orchestration platform with intelligent routing, hive consensus, and auto-agent generation

Project description

nvHive

Multi-provider LLM routing that learns from every query.

version python license tests providers models

Why nvHive

Most AI tools use a single provider. When that provider hits rate limits, changes pricing, or goes down, you're stuck.

nvHive routes queries to the best available provider automatically. It tracks which providers actually perform well for which task types, and adjusts routing based on measured quality — not static config.

What makes it different:

  • Learns from every query. The router measures real provider performance. By 20 queries it's routing based on data, not guesses.
  • Council consensus. 3+ models collaborate and synthesize. Run Nemotron + Gemma 4 locally for fully private council, or mix local + cloud.
  • Confidence-gated escalation. Tries a free model first. Escalates to premium only if the response is uncertain.
  • Cross-model verification. A second model independently checks for errors and hallucinations.

nvHive CLI

nvHive Web Dashboard


Get Started

pip install nvhive
nvh setup              # configure providers (validates keys)
nvh health             # check what's available
nvh "your question"    # try it

Works immediately with LLM7 (no signup). Run nvh setup to add free providers like Groq and GitHub Models.

NVIDIA GPU Quick Start — local inference on your hardware
# 1. Install Ollama + Nemotron
curl -fsSL https://ollama.com/install.sh | sh
ollama pull nemotron-mini        # 4.1GB, runs on 8GB+ VRAM

# 2. Install nvHive
pip install nvhive

# 3. nvHive auto-detects your GPU and Nemotron
nvh nvidia                       # GPU info + inference stack status
nvh bench                        # benchmark your GPU (tokens/sec)

# 4. Queries route to your GPU by default
nvh "Explain quicksort"          # → local Nemotron, $0, private
nvh safe "Analyze this code"     # → forced local, nothing leaves machine
nvh --prefer-nvidia "question"   # → 1.3x bonus for NVIDIA providers

# 5. Council on your GPU — 3 models, $0, fully private
nvh convene "Redis vs Postgres for sessions?"

nvHive detects NVIDIA GPUs via pynvml (VRAM, driver, CUDA version, temperature, power draw) and selects the optimal Nemotron model for your hardware. Simple queries stay local. Complex queries escalate to cloud only when needed. The learning loop measures your GPU's quality over time and adjusts routing thresholds automatically.


How It Works

Query Pipeline

flowchart TB
    USER[User Query] --> CLASSIFY[Task Classifier<br/>TF-IDF · 13 task types]
    CLASSIFY --> LOCALCHECK{Local GPU<br/>good enough?}
    
    LOCALCHECK -->|Simple query| GPU[NVIDIA GPU via Ollama<br/>Nemotron + Gemma 4<br/>Two architectures locally]
    LOCALCHECK -->|Complex query| SCORE[Score All Providers<br/>capability · cost · latency · health]
    
    SCORE --> ROUTE{Pick Best<br/>Provider}
    
    ROUTE --> FREE[Free Providers<br/>LLM7 · Groq · GitHub]
    ROUTE --> PAID[Premium Providers<br/>OpenAI · Anthropic · Google]
    ROUTE --> NIM[NVIDIA NIM<br/>Triton]
    ROUTE --> GPU
    
    FREE --> RESPONSE[Response]
    PAID --> RESPONSE
    NIM --> RESPONSE
    GPU --> RESPONSE
    
    RESPONSE --> LEARN[Learning Loop<br/>Record outcome · EMA update<br/>Adjusts GPU routing thresholds]
    LEARN -->|Feeds back into| SCORE
    
    RESPONSE -->|--verify flag| VERIFY[Cross-Model<br/>Verification]
    VERIFY --> FINAL[Verified Response]
    RESPONSE --> FINAL
    
    style GPU fill:#76B900,color:#000
    style NIM fill:#76B900,color:#000
    style LEARN fill:#1a1a2e,color:#76B900,stroke:#76B900
    style VERIFY fill:#1a1a2e,color:#00bcd4,stroke:#00bcd4

Task classification: TF-IDF cosine similarity against a 90-example training corpus (13 task types). Semantic understanding, not keyword matching.

Provider scoring: Weighted composite — capability (40%), cost (30%), latency (20%), health (10%). Capability scores start from static estimates and converge to measured performance via exponential moving average.

Adaptive learning: After every query, nvHive records the outcome and updates scores. By 20 queries per provider/task pair, routing is fully data-driven.

nvh routing-stats    # see learned vs static scores
nvh health           # provider resilience dashboard

Failover: If a provider fails, nvHive tries the next in the fallback chain. Every failure feeds back into the health score.

Local-first with NVIDIA GPUs: Simple queries route to Nemotron on your NVIDIA GPU via Ollama — no cloud, no cost, no data leaving your machine. GPU detection via pynvml reads VRAM, driver version, and CUDA version to select the optimal local model. The --prefer-nvidia flag gives a 1.3x routing bonus to keep inference on NVIDIA hardware whenever quality allows.


Council Mode

flowchart TB
    QUERY[User Query] --> AGENTS[Generate Expert Personas<br/>e.g. Backend Engineer, Architect, DBA]
    
    AGENTS --> M1[Model 1<br/>Groq / Llama]
    AGENTS --> M2[Model 2<br/>Google / Gemini]
    AGENTS --> M3[Model 3<br/>GitHub / GPT-4o]
    
    M1 --> COLLECT[Collect Responses<br/>Rate-limit staggered]
    M2 --> COLLECT
    M3 --> COLLECT
    
    COLLECT --> AGREE[Agreement Analysis<br/>Keyword overlap + LLM judge]
    AGREE --> SYNTH[Synthesis<br/>Uses non-member provider]
    
    SYNTH --> RESULT[Council Response<br/>+ Confidence Score<br/>+ Individual Perspectives]
    
    style AGREE fill:#1a1a2e,color:#00bcd4,stroke:#00bcd4
    style SYNTH fill:#1a1a2e,color:#76B900,stroke:#76B900

When one model isn't enough, nvHive runs the same query through multiple providers in parallel, then synthesizes their responses.

Why this works: Different models have different blind spots. Council mode surfaces all perspectives and synthesizes the best of each.

Confidence scoring: Every council response includes an agreement metric — "3/3 agreed" vs "split decision." Tells you when to trust the consensus.

Cost: Council with 3 free providers costs $0. Council with 3 Nemotron variants on a single NVIDIA GPU costs $0 and never leaves your machine. Premium cloud council costs ~3x a single query.

nvh convene "Should we use Redis or Postgres for session storage?"
# → 3 models debate → synthesis with confidence score

Throwdown Mode — Two-Pass Deep Analysis

Throwdown goes beyond council. Three passes, each building on the last:

flowchart TB
    QUERY[User Query] --> A1[Expert 1 - Nemotron<br/>local GPU]
    QUERY --> A2[Expert 2 - Gemma 4<br/>local GPU]
    QUERY --> A3[Expert 3 - Groq<br/>cloud free]
    
    A1 --> S1[Pass 1 Synthesis]
    A2 --> S1
    A3 --> S1
    
    S1 --> B1[Expert 1 - Critiques]
    S1 --> B2[Expert 2 - Finds blind spots]
    S1 --> B3[Expert 3 - Challenges assumptions]
    
    B1 --> S2[Pass 2 Synthesis]
    B2 --> S2
    B3 --> S2
    
    S2 --> FINAL[Final Answer]
    
    style A1 fill:#1a1a2e,stroke:#76B900,color:#c8c8c8
    style A2 fill:#1a1a2e,stroke:#76B900,color:#c8c8c8
    style A3 fill:#1a1a2e,stroke:#76B900,color:#c8c8c8
    style B1 fill:#1a1a2e,stroke:#00bcd4,color:#c8c8c8
    style B2 fill:#1a1a2e,stroke:#00bcd4,color:#c8c8c8
    style B3 fill:#1a1a2e,stroke:#00bcd4,color:#c8c8c8
    style FINAL fill:#76B900,color:#000
nvh throwdown "Review this architecture for scalability issues"
# Pass 1: 3 experts analyze independently
# Pass 2: experts critique each other's analysis
# Pass 3: final synthesis integrating all perspectives

Why throwdown beats single-model: A single model gives you one perspective, once. Throwdown gives you three perspectives, challenged by three critiques, then synthesized. Errors get caught. Assumptions get questioned. The final answer is more thorough than any single pass.


Smart Query Features

# Confidence-gated escalation: try free first, upgrade only if needed
nvh ask --escalate "Design a distributed lock manager"
# → groq (free, confidence: 42%) → auto-escalated to openai

# Cross-model verification: a second model checks the answer
nvh ask --verify "Is eval() safe in Python?"
# → groq answers → google verifies ✓ (9/10, no issues)

# Both together: cheapest possible verified answer
nvh ask --escalate --verify "Explain the CAP theorem"

Local GPU Inference with Nemotron

nvh setup detects your NVIDIA GPU, selects which models fit in your VRAM, and pulls them automatically. Supports both NVIDIA Nemotron and Google Gemma 4 (NVIDIA-optimized) for local council with two different architectures.

nvHive GPU Detection & Model Selection

nvh setup
# Step 3/3: Local GPU inference
#   Detected: NVIDIA GeForce RTX 4090 (24GB VRAM)
#   Models: nemotron-small, gemma4:26b
#   Pulling nemotron-small... ✓
#   Pulling gemma4:26b... ✓
#   Local council ready — multiple models for consensus

What nvh setup handles:

flowchart TB
    SETUP[nvh setup] --> DETECT[GPU Detection<br/>pynvml reads VRAM · driver · CUDA]
    
    DETECT --> VRAM{Available VRAM?}
    
    VRAM -->|< 6 GB| MINI[nemotron-mini<br/>+ gemma4:e2b]
    VRAM -->|6 – 12 GB| SMALL[nemotron-small<br/>+ gemma4:e4b]
    VRAM -->|12 – 48 GB| CHOICE{User choice}
    VRAM -->|48 GB+| FULL[nemotron 70B<br/>+ gemma4:31b]

    CHOICE -->|Both for council| DUAL[nemotron-small<br/>+ gemma4:26b]
    CHOICE -->|Single model| SINGLE[nemotron 70B only]

    MINI --> CHECK{Ollama running?}
    SMALL --> CHECK
    DUAL --> CHECK
    SINGLE --> CHECK
    FULL --> CHECK
    
    CHECK -->|Not installed| INSTALL[Show install command]
    CHECK -->|Not running| START[Show: ollama serve]
    CHECK -->|Running| PULL[Auto-pull all<br/>models that fit]
    
    PULL --> READY[Ready ✓<br/>Local council enabled]
    
    READY --> ROUTE[nvHive Router<br/>Two model architectures<br/>Learning loop active]
    
    style SMALL fill:#76B900,color:#000
    style DUAL fill:#76B900,color:#000
    style READY fill:#76B900,color:#000
    style ROUTE fill:#1a1a2e,color:#76B900,stroke:#76B900

After setup, routing is automatic:

  • Simple queries → local Nemotron or Gemma 4 on your GPU (free, private)
  • Council mode → both models collaborate locally, catching different blind spots
  • Complex queries → cloud providers when local quality isn't sufficient
  • nvh bench measures your GPU's actual tok/s with community baselines
  • The learning loop measures each model's quality on YOUR hardware

Full GPU detection + VRAM guide

NVIDIA Inference Stack

Layer Technology Hardware Use Case
Local Ollama + Nemotron Consumer GPUs (RTX 3060+) Default local inference, privacy mode
Local Ollama + Gemma 4 Consumer GPUs (RTX 3060+) NVIDIA-optimized, reasoning + multimodal
Cloud NVIDIA NIM API NVIDIA cloud Specialized models, 1000 free credits
Enterprise Triton Inference Server H100 / A100 / L40 Production multi-model serving, TensorRT-LLM
Agent NemoClaw / OpenShell Any Agent orchestration with nvHive routing
Detection pynvml Any NVIDIA GPU VRAM, driver, CUDA, temp, power, PCIe

--prefer-nvidia gives a 1.3x routing bonus to all NVIDIA-backed providers, keeping inference on NVIDIA hardware whenever quality allows.


Integrations

How nvHive Connects to Your Tools

flowchart LR
    subgraph Your Tools
        CLI[nvh CLI]
        SDK[Python SDK<br/>import nvh]
        CC[Claude Code<br/>MCP]
        OC[OpenClaw<br/>Agent]
        NC[NemoClaw<br/>Agent]
        CU[Cursor]
        APP[Your App<br/>OpenAI SDK]
    end

    subgraph nvHive Engine
        API[API Server<br/>:8000]
        MCP[MCP Server<br/>stdio]
        PROXY_OAI[OpenAI Proxy<br/>/v1/proxy]
        PROXY_ANT[Anthropic Proxy<br/>/v1/anthropic]
        ROUTER[Adaptive Router<br/>+ Learning Loop]
        COUNCIL[Council Engine<br/>+ Confidence]
        ESCALATE[Escalation<br/>+ Verification]
    end

    subgraph Providers
        GPU[Your GPU<br/>Ollama · Nemotron]
        FREE_P[Free Cloud<br/>Groq · GitHub · LLM7<br/>Google · Cerebras]
        PAID_P[Paid Cloud<br/>OpenAI · Anthropic<br/>DeepSeek · Mistral]
        NIM[NVIDIA NIM<br/>Triton]
    end

    CLI --> API
    SDK --> API
    CC --> MCP
    OC --> MCP
    NC --> PROXY_OAI
    CU --> MCP
    APP --> PROXY_OAI
    APP --> PROXY_ANT

    MCP --> API
    PROXY_OAI --> API
    PROXY_ANT --> API
    API --> ROUTER
    API --> COUNCIL
    API --> ESCALATE
    ROUTER --> GPU
    ROUTER --> FREE_P
    ROUTER --> PAID_P
    ROUTER --> NIM

    style GPU fill:#76B900,color:#000
    style NIM fill:#76B900,color:#000
    style ROUTER fill:#1a1a2e,color:#76B900,stroke:#76B900
    style COUNCIL fill:#1a1a2e,color:#00bcd4,stroke:#00bcd4

API Proxies — point existing SDKs at nvHive:

SDK Configuration
Anthropic ANTHROPIC_BASE_URL=http://localhost:8000/v1/anthropic
OpenAI OPENAI_BASE_URL=http://localhost:8000/v1/proxy
Claude Code claude mcp add nvhive -- python -m nvh.mcp_server
Cursor nvh integrate --auto

Works With OpenClaw & NemoClaw

nvHive works alongside OpenClaw as a routing layer, and integrates with NemoClaw (NVIDIA's agent framework) as both inference provider and MCP tool server.

nvHive NemoClaw Integration

nvh migrate --from openclaw    # import your existing API keys
nvh nemoclaw --start           # start proxy for NemoClaw agents

Note: Anthropic recently changed billing for third-party tools. See the integration guide for details.


For Tool Builders

nvHive is a routing layer. Any AI application can add multi-provider routing:

import nvh

# Drop-in OpenAI-compatible interface
response = await nvh.complete([
    {"role": "user", "content": "Explain quicksort"}
])

# Inspect routing without executing
decision = await nvh.route("complex question about databases")

# Council consensus
result = await nvh.convene("Architecture review", cabinet="engineering")

# Provider health check
status = await nvh.health()

SDK & API reference


Feature Matrix

Feature CLI Python SDK REST API MCP OpenClaw NemoClaw
Smart routing
Council consensus
Throwdown analysis
Confidence scoring
Escalation (--escalate)
Verification (--verify)
Local GPU inference
Adaptive learning
Provider health
Budget controls
Streaming
Privacy mode

Core Commands

Command What It Does
nvh "question" Smart route to best available model
nvh convene "question" Council consensus (3+ models)
nvh throwdown "question" Two-pass deep analysis with critique
nvh safe "question" Local only — nothing leaves your machine
nvh ask --escalate Try free first, escalate if uncertain
nvh ask --verify Cross-model verification
nvh health Provider resilience dashboard
nvh why Explain last routing decision
nvh history Recent queries with costs and timing
nvh bench GPU speed test (tokens/sec)
nvh bench -q Speed + quality comparison
nvh routing-stats Learned vs static routing scores
nvh nvidia NVIDIA GPU infrastructure status
nvh migrate Import keys from OpenClaw / Claude Desktop
nvh setup Interactive provider setup

Full command reference (50+ commands)

Providers

23 providers. 63 models. 25 free — no credit card required.

Tier Providers Rate Limits
Free (no signup) Ollama (local), LLM7 Unlimited / 30 RPM
Free (email signup) Groq, GitHub Models, Cerebras, SambaNova, Cohere, AI21, SiliconFlow, HuggingFace 15-30 RPM
Free (account) Google Gemini, Mistral, NVIDIA NIM 15-1000 RPM
Paid OpenAI, Anthropic, DeepSeek, Fireworks, Together, OpenRouter, Grok Pay per token

Verify It Yourself

nvHive Benchmark Demo

nvh bench              # GPU speed (tokens/sec)
nvh bench -q           # speed + quality comparison
nvh health             # provider resilience
nvh why                # explain last routing decision
nvh routing-stats      # learning in action
nvh history            # recent queries with costs

16 quality prompts across code generation, debugging, reasoning, math, creative writing, and Q&A. Blind LLM judge. Run it yourself. Publish the results.

Learn More

Guide Description
Getting Started First-time setup
Commands Full CLI reference (50+ commands)
Providers 23 providers, rate limits, free tiers
Council System Multi-LLM consensus with confidence scoring
OpenClaw Integration Works alongside OpenClaw
Claude Code MCP server setup
GPU Detection Auto-detection, model selection, OOM protection
NemoClaw NVIDIA NemoClaw integration
SDK & API Python SDK, REST API, proxies
Architecture System design and adaptive learning

License

MIT License. See LICENSE for details.

Project details


Release history Release notifications | RSS feed

This version

0.5.3

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nvhive-0.5.3.tar.gz (357.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

nvhive-0.5.3-py3-none-any.whl (391.6 kB view details)

Uploaded Python 3

File details

Details for the file nvhive-0.5.3.tar.gz.

File metadata

  • Download URL: nvhive-0.5.3.tar.gz
  • Upload date:
  • Size: 357.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for nvhive-0.5.3.tar.gz
Algorithm Hash digest
SHA256 6543cee094390ae24958183cf87da39611b559a9740cbdff539f77a26ea97669
MD5 390034c37feac48190c9af6b36b6d8d2
BLAKE2b-256 0bc104a3bba43b6bbd07293b270984dd34b67ec12d4932ec8d75e5e3d05f15e3

See more details on using hashes here.

File details

Details for the file nvhive-0.5.3-py3-none-any.whl.

File metadata

  • Download URL: nvhive-0.5.3-py3-none-any.whl
  • Upload date:
  • Size: 391.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for nvhive-0.5.3-py3-none-any.whl
Algorithm Hash digest
SHA256 5e94994ece6700764071b74e41e9a272da4d1fa45a9120bb0de72951a5b20353
MD5 d96bfbf07069656705a1e3c9d1890a02
BLAKE2b-256 ebd2cc34e3d34b004d416b5df380757af8679ad63ed3ae2edfecad95ddd9af2b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page