Run Claude Code (and any Anthropic SDK client) on NVIDIA NIM models via a local proxy.
Project description
nvd-claude-proxy
Run Claude Code — and any Anthropic SDK client — on enterprise-grade NVIDIA NIM models.
nvd-claude-proxy is a low-latency local HTTP proxy that translates between the Anthropic Messages API and the NVIDIA NIM (OpenAI-compatible) API. The default runtime now uses the lightweight R2 path optimized for Claude Code responsiveness.
🚀 Key Features
- Architectural Excellence: Fully decoupled core translation logic from the transport layer.
- Enterprise Resilience: Built-in Circuit Breakers and automated failover chains to protect against upstream outages.
- Idempotency Support: Request deduplication and safe retries via
anthropic-idempotency-keyacross Redis, SQLite, and Memory backends. - Scalable State: Distributed session management via Redis (with SQLite and In-Memory fallbacks).
- Official-Grade Security: Unified
AuthMiddlewareprotecting all endpoints with global API key enforcement. - Claude Code Optimized: Specifically tuned for Claude Code's complex tool-calling and reasoning patterns.
- Vision & Progressive Streaming: Fine-grained progressive tool streaming and real-time multimodal (
image_url) parity. - Modular Pipeline: Event-driven streaming architecture for deterministic state management.
🛠 Deployment & Configuration
Environment Variables
| Variable | Default | Description |
|---|---|---|
NVIDIA_API_KEY |
(Required) | Your NVIDIA NIM API key. |
PROXY_API_KEY |
None | Optional key to protect the proxy itself. |
STORAGE_ENGINE |
sqlite |
Persistence backend: redis, sqlite, or memory. |
REDIS_URL |
None | Required if STORAGE_ENGINE=redis (e.g., redis://localhost:6379). |
PROXY_PORT |
8788 |
Local port for the proxy. |
RATE_LIMIT_RPM |
0 |
Global rate limit (requests per minute). 0 to disable. |
Quick Start
# Install the proxy
pip install nvd-claude-proxy[full]
# Export your API key
export NVIDIA_API_KEY=nvapi-...
# Start the default low-latency runtime and launch Claude Code
ncp code
Then point your Claude Code at the proxy:
export ANTHROPIC_BASE_URL=http://localhost:8788
claude
🏗 Architecture
The proxy uses a Chain of Responsibility pattern for streaming events:
MetadataProcessor -> TextProcessor -> ToolProcessor -> SafetyProcessor -> FinalizerProcessor
This ensures that even complex interleaved reasoning and parallel tool calls are correctly reconstructed for the Anthropic SDK.
Official-Grade Infrastructure for the AI Era.
Production Claude Code + NVIDIA NIM configuration
Use this proxy as the Anthropic-compatible endpoint for Claude Code:
export NVIDIA_API_KEY=nvapi-...
export PROXY_PORT=8788
export MAX_REQUEST_BODY_MB=32
export REQUEST_TIMEOUT_SECONDS=600
export STORAGE_ENGINE=redis
export REDIS_URL=redis://127.0.0.1:6379
# Optional but strongly recommended for shared/devbox usage
export PROXY_API_KEY=replace-with-a-long-random-secret
Run the proxy:
uv run ncp run
# or: ncp run
Point Claude Code at the proxy:
export ANTHROPIC_BASE_URL=http://127.0.0.1:8788
export ANTHROPIC_AUTH_TOKEN=dummy
claude
Recommended production notes
- Prefer
STORAGE_ENGINE=redisfor stable rate limiting, idempotency, and multi-session behavior. - Keep
MAX_REQUEST_BODY_MB=32to avoid pathological payloads while still supporting large Claude Code tool catalogs. - Use the default streaming path; it emits early
message_startand periodicpingevents to reduce apparent latency and prevent idle timeouts. - If tool calls appear slow or malformed upstream, start with
claude-sonnet-4-6orclaude-haiku-4-5mappings before moving to larger reasoning models. - This proxy is translation-only: Claude Code executes tools locally; the proxy must preserve tool ordering, streamed JSON fragments, and Anthropic-compatible SSE grammar.
R2 low-latency mode
Version 1.3.5 adds a lightweight hosted-catalog runtime inspired by the one-file reference proxy. Use it when you care more about fast first-token latency and minimal overhead than about the full production registry/session stack.
Start R2 mode
ncp r2 --model nvidia/llama-3.3-nemotron-super-49b-v1.5
# or
nvd-claude-proxy-r2
Then point Claude Code at it:
M=nvidia/llama-3.3-nemotron-super-49b-v1.5
export ANTHROPIC_BASE_URL=http://127.0.0.1:8787
export ANTHROPIC_API_KEY=not-used
export ANTHROPIC_CUSTOM_MODEL_OPTION=$M
export ANTHROPIC_DEFAULT_HAIKU_MODEL=$M
export ANTHROPIC_DEFAULT_OPUS_MODEL=$M
export ANTHROPIC_DEFAULT_SONNET_MODEL=$M
export CLAUDE_CODE_SUBAGENT_MODEL=$M
claude
Why use R2 mode
- eager
message_startfor lower perceived TTFT - 15s ping heartbeat during silent reasoning phases
- simpler tool translation path
- direct NVIDIA model IDs, no alias registry required
- less overhead than the full production runtime
Default runtime in 1.4.0
Starting with 1.4.0, the default commands now use the low-latency R2 runtime:
ncp code→ starts the R2 runtime and launches Claude Codencp proxy→ starts the R2 runtime onlyncp r2→ explicit alias for the same default runtimenvd-claude-proxy→ starts the R2 runtime when invoked as the package entrypoint
This change prioritizes:
- faster first-token latency
- simpler Claude Code model wiring
- lower runtime overhead
- direct NVIDIA model IDs
Use NCP_DEFAULT_MODEL to override the default hosted NVIDIA model used by ncp code and ncp proxy.
Streaming quality and visualization
The default runtime now emphasizes Anthropic-style streaming quality:
- SSE
id:field is emitted on every event - early
message_startfor lower perceived TTFT - keepalive
pingevents during silent upstream gaps - progressive
message_deltausage snapshots after content-block closes - visualization side-channel events via
event: ncp_visualization
R2 streaming environment knobs
R2_PING_INTERVAL— keepalive cadence in secondsR2_TEXT_DELTA_CHARS— max chunk size for text/thinking deltasR2_STREAM_VISUALIZATION— enable or disable visualization side-channel eventsR2_MESSAGE_DELTA_EVERY_BLOCK— emit progress usage snapshots after each content block stop
Visualization endpoint
The runtime also exposes:
GET /v1/stream/visualization
This reports the currently active visualization behavior for dashboards or debugging tools.
Stream dashboard
The low-latency runtime now ships with a beautiful live stream visualization UI.
Open:
/dashboard/stream
Features:
- glassmorphism dark UI
- live color-coded event timeline
- state graph lanes for lifecycle, content, tools, and diagnostics
- websocket-driven real-time visualization from the R2 stream side-channel
- usage progress counters and live request tracking
This UI is powered by the ncp_visualization side-channel and the websocket endpoint:
/ws/stream-visualization
Default max tokens
The default R2 runtime now supports a built-in default output budget for upstream requests when the client does not explicitly send max_tokens.
Use either:
export NCP_DEFAULT_MAX_TOKENS=12000
or per launch:
ncp code --max-tokens 12000
This is especially useful for large codebase mapping tasks where Claude Code may otherwise request too much output for the selected model context window.
Automatic fallback and context-safe retries
Before publishing, the default R2 runtime was further hardened to reduce Claude retry loops:
- automatic fallback across
NCP_FALLBACK_MODELSwhen the primary model is retired, missing, rate-limited, or transiently failing - automatic max-token reduction retry when NVIDIA returns context-length overflow style 400s
- startup diagnostics now print the dashboard and health URLs immediately
Override fallback models with:
export NCP_FALLBACK_MODELS="meta/llama-4-maverick-17b-128e-instruct,deepseek-ai/deepseek-v4-flash,qwen/qwen3-coder-480b-a35b-instruct"
1.4.1 stability upgrade
Version 1.4.1 adds:
NCP_DEFAULT_MAX_TOKENSandncp code --max-tokensNCP_FALLBACK_MODELSautomatic model fallback- context-safe retry when upstream rejects oversized context windows
- improved startup diagnostics for dashboard and health endpoints
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file nvd_claude_proxy-1.4.1.tar.gz.
File metadata
- Download URL: nvd_claude_proxy-1.4.1.tar.gz
- Upload date:
- Size: 137.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f3934c674a002ed3d4b0f2b60976a243b4869a912fff398bd1957fc04df995a7
|
|
| MD5 |
dd84a510ca907584015b0d25bc16e243
|
|
| BLAKE2b-256 |
5f05c207dca9626666b33d082dca669ea1a16a73561a60e397a27fabd1eb7173
|
File details
Details for the file nvd_claude_proxy-1.4.1-py3-none-any.whl.
File metadata
- Download URL: nvd_claude_proxy-1.4.1-py3-none-any.whl
- Upload date:
- Size: 162.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c5e533af481b0d5d50509d864ca0c5035cf26d177a12c5fecc7885cef4ed095c
|
|
| MD5 |
f50ee7e87d80723fb8d1fd893664491e
|
|
| BLAKE2b-256 |
99a2232974dc8f189afbc0f89e8bcd9c66d2983016ba089960a9a5e885b253a6
|