Skip to main content

OpenAI-compatible proxy that routes requests to the best NVIDIA NIM model by task

Project description

NIM Model Router

NIM Model Router

CI

OpenAI-compatible proxy that routes requests to the best NVIDIA NIM model by task.

NVIDIA's NIM catalog has 100+ models. Picking the right one for each request is tedious. This router sits in front of the NIM API and automatically selects a model based on what you're asking for — fast chat, agentic tool use, deep reasoning, coding, embeddings, reranking, and more.

Drop it into any OpenAI SDK client by changing base_url. No other code changes required.

Quick start

git clone https://github.com/cobusgreyling/nim-model-router.git
cd nim-model-router
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"

cp .env.example .env
# Edit .env and set NVIDIA_API_KEY

nim-router serve

The proxy listens on http://127.0.0.1:8080. API docs: http://127.0.0.1:8080/docs

Docker

cp .env.example .env  # set NVIDIA_API_KEY
docker compose up --build

Usage

Auto-routing (recommended)

from openai import OpenAI

client = OpenAI(
    base_url="http://127.0.0.1:8080/v1",
    api_key="local",  # not used — router injects NVIDIA_API_KEY upstream
)

response = client.chat.completions.create(
    model="nim-router/auto",
    messages=[{"role": "user", "content": "Build a Python agent with tool calling"}],
)
print(response.choices[0].message.content)

Explicit task aliases

Model alias Task Default NIM model
nim-router/auto Classify automatically
nim-router/fast Short Q&A, classification meta/llama-3.1-8b-instruct
nim-router/general Ambiguous general chat nvidia/nemotron-3-nano-30b-a3b
nim-router/agentic Tool use, agents nvidia/nemotron-3-super-120b-a12b
nim-router/reasoning Deep analysis nvidia/nemotron-3-ultra-550b-a55b
nim-router/long-context Large documents nvidia/nemotron-3-super-120b-a12b
nim-router/coding Code generation nvidia/llama-3.3-nemotron-super-49b-v1.5
nim-router/embedding Embeddings nvidia/llama-nemotron-embed-1b-v2
nim-router/rerank Reranking nvidia/llama-nemotron-rerank-1b-v2

Force a task via header

curl http://127.0.0.1:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "X-NIM-Task: reasoning" \
  -d '{
    "model": "nim-router/auto",
    "messages": [{"role": "user", "content": "Analyze the root cause step by step"}]
  }'

Rerank

curl http://127.0.0.1:8080/v1/rerank \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nim-router/rerank",
    "query": "What is NVIDIA NIM?",
    "documents": ["NIM provides optimized inference...", "Unrelated text"],
    "top_n": 2
  }'

Passthrough to a specific NIM model

If you pass a concrete NIM model ID (e.g. meta/llama-3.1-70b-instruct), the router forwards it unchanged.

How routing works

flowchart TD
    A[Client request] --> B{Model alias?}
    B -->|nim-router/*| C[Resolve task]
    B -->|concrete NIM ID| D[Passthrough]
    B -->|nim-router/auto| E[Classifier]
    A --> F{X-NIM-Task header?}
    F -->|set| C
    E --> G{Signals}
    G -->|tools| H[agentic]
    G -->|large prompt| I[long_context]
    G -->|keywords| J[coding / reasoning / rerank]
    G -->|short prompt| K[fast]
    G -->|ambiguous| L[general]
    C --> M[Apply policies]
    M --> N[Pick model + fallbacks]
    N --> O[Proxy to NIM API]
    O -->|5xx| P[Try fallback model]

Classifier signals:

  • Tools presentagentic
  • Large prompt (>12k estimated tokens, tiktoken) → long_context
  • Reasoning keywordsreasoning
  • Coding keywordscoding
  • Rerank keywords / query+documentsrerank
  • Short prompt (≤120 chars) → fast
  • Ambiguousgeneral (not ultra-expensive agentic)

Policies can downgrade ultra models for short prompts and route low-confidence requests to general.

CLI

# Start proxy
nim-router serve --port 8080 --config src/nim_model_router/models.yaml

# Dry-run routing (no API call)
nim-router route "refactor this Python function" --json
nim-router route "hello"
nim-router route "plan a multi-step agent" --tools

# Show registry
nim-router models

# Sync model suggestions from NIM catalog
nim-router catalog-sync --task coding

# Print OpenAI SDK example
nim-router client-example

Observability

Every proxied response includes routing metadata:

Header Example
X-NIM-Routed-Task agentic
X-NIM-Routed-Model nvidia/nemotron-3-super-120b-a12b
X-NIM-Router-Reason request includes tool definitions
X-NIM-Router-Confidence 0.950
# Live stats
curl http://127.0.0.1:8080/v1/router/stats

# Task registry + cost summary
curl http://127.0.0.1:8080/v1/router/tasks

# Reload config without restart
curl -X POST http://127.0.0.1:8080/v1/router/reload

# Prometheus metrics
curl http://127.0.0.1:8080/metrics

# Dry-run endpoint
curl -X POST http://127.0.0.1:8080/v1/router/dry-run \
  -H "Content-Type: application/json" \
  -d '{"messages":[{"role":"user","content":"debug my rust code"}]}'

Set ROUTER_LOG_PATH=data/router.log.jsonl to persist request logs.

Environment variables

Variable Default Description
NVIDIA_API_KEY Required upstream API key
NIM_BASE_URL https://integrate.api.nvidia.com/v1 NIM API base URL
ROUTER_HOST 127.0.0.1 Proxy bind host
ROUTER_PORT 8080 Proxy bind port
ROUTER_CONFIG bundled models.yaml Custom registry path
ROUTER_LOG_PATH JSONL log file path
ROUTER_API_KEY Optional client auth key
UPSTREAM_MAX_RETRIES 3 Retries for 429/5xx
UPSTREAM_RETRY_BACKOFF_SECONDS 0.5 Retry backoff base
ENABLE_PROMETHEUS true Expose /metrics
HEALTH_CHECK_UPSTREAM false Include upstream status in /health
MAX_REQUEST_BODY_BYTES 10485760 Max request body size

Customizing models

Edit src/nim_model_router/models.yaml (or set ROUTER_CONFIG):

tasks:
  agentic:
    model: nvidia/nemotron-3-nano-30b-a3b
    fallbacks:
      - general
      - fast
    extra_body:
      enable_thinking: true
      reasoning_budget: 2048
    ab_test:
      enabled: false
      variants:
        - model: nvidia/nemotron-3-nano-30b-a3b
          weight: 50
        - model: nvidia/nemotron-3-super-120b-a12b
          weight: 50

classifier:
  use_llm_classifier: false  # set true + pip install ".[llm-classifier]"

Reload at runtime: curl -X POST http://127.0.0.1:8080/v1/router/reload

Integrations

LiteLLM

model_list:
  - model_name: nim-auto
    litellm_params:
      model: openai/nim-router/auto
      api_base: http://127.0.0.1:8080/v1
      api_key: local

LangChain

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    base_url="http://127.0.0.1:8080/v1",
    api_key="local",
    model="nim-router/auto",
)

Continue / Cursor (OpenAI-compatible)

{
  "models": [{
    "title": "NIM Auto",
    "provider": "openai",
    "model": "nim-router/auto",
    "apiBase": "http://127.0.0.1:8080/v1",
    "apiKey": "local"
  }]
}

Development

pip install -e ".[dev]"
pytest --cov=nim_model_router --cov-report=term-missing
ruff check src tests
ruff format src tests

See CONTRIBUTING.md.

Security

  • Store NVIDIA_API_KEY in .env — never commit it.
  • Set ROUTER_API_KEY before exposing the proxy beyond localhost.
  • Bind to 127.0.0.1 by default. Use Docker/reverse proxy auth for production.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nim_model_router-0.2.0.tar.gz (25.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

nim_model_router-0.2.0-py3-none-any.whl (27.8 kB view details)

Uploaded Python 3

File details

Details for the file nim_model_router-0.2.0.tar.gz.

File metadata

  • Download URL: nim_model_router-0.2.0.tar.gz
  • Upload date:
  • Size: 25.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for nim_model_router-0.2.0.tar.gz
Algorithm Hash digest
SHA256 8c46fa0a5e9ac20147f4733236c715ec33b18664960b1fdf3c56ff442022b21a
MD5 bb413133dd41315f3d587fa19b8233bf
BLAKE2b-256 c75c2bb6b5c849509dd4ee71bf7e802c8f1295d6b91c0d65dbf9c23c75dfc59d

See more details on using hashes here.

Provenance

The following attestation bundles were made for nim_model_router-0.2.0.tar.gz:

Publisher: publish.yml on cobusgreyling/nim-model-router

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file nim_model_router-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for nim_model_router-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e09532c95798aed7d4f5352a5244426c514bfa4186143c5671b87a9cddde29cc
MD5 db414530026bd9d5efe4939c047cc323
BLAKE2b-256 883755b5e58d665139913a45b43bdbc8d390a988c36cdabf66b8aa60251d711c

See more details on using hashes here.

Provenance

The following attestation bundles were made for nim_model_router-0.2.0-py3-none-any.whl:

Publisher: publish.yml on cobusgreyling/nim-model-router

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page