Run Claude Code (and any Anthropic SDK client) on NVIDIA NIM models via a local proxy.

These details have not been verified by PyPI

Project links

Project description

nvd-claude-proxy

Run Claude Code — and any Anthropic SDK client — on enterprise-grade NVIDIA NIM models.

nvd-claude-proxy is a low-latency local HTTP proxy that translates between the Anthropic Messages API and the NVIDIA NIM (OpenAI-compatible) API. The default runtime now uses the lightweight R2 path optimized for Claude Code responsiveness.

🚀 Key Features

Architectural Excellence: Fully decoupled core translation logic from the transport layer.
Enterprise Resilience: Built-in Circuit Breakers and automated failover chains to protect against upstream outages.
Idempotency Support: Request deduplication and safe retries via anthropic-idempotency-key across Redis, SQLite, and Memory backends.
Scalable State: Distributed session management via Redis (with SQLite and In-Memory fallbacks).
Official-Grade Security: Unified AuthMiddleware protecting all endpoints with global API key enforcement.
Claude Code Optimized: Specifically tuned for Claude Code's complex tool-calling and reasoning patterns.
Vision & Progressive Streaming: Fine-grained progressive tool streaming and real-time multimodal (image_url) parity.
Modular Pipeline: Event-driven streaming architecture for deterministic state management.

🛠 Deployment & Configuration

Environment Variables

Variable	Default	Description
`NVIDIA_API_KEY`	(Required)	Your NVIDIA NIM API key.
`PROXY_API_KEY`	None	Optional key to protect the proxy itself.
`STORAGE_ENGINE`	`sqlite`	Persistence backend: `redis`, `sqlite`, or `memory`.
`REDIS_URL`	None	Required if `STORAGE_ENGINE=redis` (e.g., `redis://localhost:6379`).
`PROXY_PORT`	`8788`	Local port for the proxy.
`RATE_LIMIT_RPM`	`0`	Global rate limit (requests per minute). `0` to disable.

Quick Start

# Install the proxy
pip install nvd-claude-proxy[full]

# Export your API key
export NVIDIA_API_KEY=nvapi-...

# Start the default low-latency runtime and launch Claude Code
ncp code

Then point your Claude Code at the proxy:

export ANTHROPIC_BASE_URL=http://localhost:8788
claude

🏗 Architecture

The proxy uses a Chain of Responsibility pattern for streaming events: MetadataProcessor -> TextProcessor -> ToolProcessor -> SafetyProcessor -> FinalizerProcessor

This ensures that even complex interleaved reasoning and parallel tool calls are correctly reconstructed for the Anthropic SDK.

Official-Grade Infrastructure for the AI Era.

Production Claude Code + NVIDIA NIM configuration

Use this proxy as the Anthropic-compatible endpoint for Claude Code:

export NVIDIA_API_KEY=nvapi-...
export PROXY_PORT=8788
export MAX_REQUEST_BODY_MB=32
export REQUEST_TIMEOUT_SECONDS=600
export STORAGE_ENGINE=redis
export REDIS_URL=redis://127.0.0.1:6379

# Optional but strongly recommended for shared/devbox usage
export PROXY_API_KEY=replace-with-a-long-random-secret

Run the proxy:

uv run ncp run
# or: ncp run

Point Claude Code at the proxy:

export ANTHROPIC_BASE_URL=http://127.0.0.1:8788
export ANTHROPIC_AUTH_TOKEN=dummy
claude

Recommended production notes

Prefer STORAGE_ENGINE=redis for stable rate limiting, idempotency, and multi-session behavior.
Keep MAX_REQUEST_BODY_MB=32 to avoid pathological payloads while still supporting large Claude Code tool catalogs.
Use the default streaming path; it emits early message_start and periodic ping events to reduce apparent latency and prevent idle timeouts.
If tool calls appear slow or malformed upstream, start with claude-sonnet-4-6 or claude-haiku-4-5 mappings before moving to larger reasoning models.
This proxy is translation-only: Claude Code executes tools locally; the proxy must preserve tool ordering, streamed JSON fragments, and Anthropic-compatible SSE grammar.

R2 low-latency mode

Version 1.3.5 adds a lightweight hosted-catalog runtime inspired by the one-file reference proxy. Use it when you care more about fast first-token latency and minimal overhead than about the full production registry/session stack.

Start R2 mode

ncp r2 --model nvidia/llama-3.3-nemotron-super-49b-v1.5
# or
nvd-claude-proxy-r2

Then point Claude Code at it:

M=nvidia/llama-3.3-nemotron-super-49b-v1.5
export ANTHROPIC_BASE_URL=http://127.0.0.1:8787
export ANTHROPIC_API_KEY=not-used
export ANTHROPIC_CUSTOM_MODEL_OPTION=$M
export ANTHROPIC_DEFAULT_HAIKU_MODEL=$M
export ANTHROPIC_DEFAULT_OPUS_MODEL=$M
export ANTHROPIC_DEFAULT_SONNET_MODEL=$M
export CLAUDE_CODE_SUBAGENT_MODEL=$M
claude

Why use R2 mode

eager message_start for lower perceived TTFT
15s ping heartbeat during silent reasoning phases
simpler tool translation path
direct NVIDIA model IDs, no alias registry required
less overhead than the full production runtime

Default runtime in 1.4.0

Starting with 1.4.0, the default commands now use the low-latency R2 runtime:

ncp code → starts the R2 runtime and launches Claude Code
ncp proxy → starts the R2 runtime only
ncp r2 → explicit alias for the same default runtime
nvd-claude-proxy → starts the R2 runtime when invoked as the package entrypoint

This change prioritizes:

faster first-token latency
simpler Claude Code model wiring
lower runtime overhead
direct NVIDIA model IDs

Use NCP_DEFAULT_MODEL to override the default hosted NVIDIA model used by ncp code and ncp proxy.

Streaming quality and visualization

The default runtime now emphasizes Anthropic-style streaming quality:

SSE id: field is emitted on every event
early message_start for lower perceived TTFT
keepalive ping events during silent upstream gaps
progressive message_delta usage snapshots after content-block closes
visualization side-channel events via event: ncp_visualization

R2 streaming environment knobs

R2_PING_INTERVAL — keepalive cadence in seconds
R2_TEXT_DELTA_CHARS — max chunk size for text/thinking deltas
R2_STREAM_VISUALIZATION — enable or disable visualization side-channel events
R2_MESSAGE_DELTA_EVERY_BLOCK — emit progress usage snapshots after each content block stop

Visualization endpoint

The runtime also exposes:

GET /v1/stream/visualization

This reports the currently active visualization behavior for dashboards or debugging tools.

Stream dashboard

The low-latency runtime now ships with a beautiful live stream visualization UI.

Open:

/dashboard/stream

Features:

glassmorphism dark UI
live color-coded event timeline
state graph lanes for lifecycle, content, tools, and diagnostics
websocket-driven real-time visualization from the R2 stream side-channel
usage progress counters and live request tracking

This UI is powered by the ncp_visualization side-channel and the websocket endpoint:

/ws/stream-visualization

Default max tokens

The default R2 runtime now supports a built-in default output budget for upstream requests when the client does not explicitly send max_tokens.

Use either:

export NCP_DEFAULT_MAX_TOKENS=12000

or per launch:

ncp code --max-tokens 12000

This is especially useful for large codebase mapping tasks where Claude Code may otherwise request too much output for the selected model context window.

Automatic fallback and context-safe retries

Before publishing, the default R2 runtime was further hardened to reduce Claude retry loops:

automatic fallback across NCP_FALLBACK_MODELS when the primary model is retired, missing, rate-limited, or transiently failing
automatic max-token reduction retry when NVIDIA returns context-length overflow style 400s
startup diagnostics now print the dashboard and health URLs immediately

Override fallback models with:

export NCP_FALLBACK_MODELS="meta/llama-4-maverick-17b-128e-instruct,deepseek-ai/deepseek-v4-flash,qwen/qwen3-coder-480b-a35b-instruct"

1.4.1 stability upgrade

Version 1.4.1 adds:

NCP_DEFAULT_MAX_TOKENS and ncp code --max-tokens
NCP_FALLBACK_MODELS automatic model fallback
context-safe retry when upstream rejects oversized context windows
improved startup diagnostics for dashboard and health endpoints

1.4.2 classic R2 restore

Version 1.4.2 restores the smooth 1.3.5-style R2 hosted-catalog flow as the primary experience, while keeping non-invasive improvements like the stream dashboard and streaming observability.

Primary command

ncp code now uses the restored classic R2 path.

Recommended launch:

ncp code --model nvidia/llama-3.3-nemotron-super-49b-v1.5 --max-tokens 12000

Permanent fix for context overflow

The classic R2 runtime now applies permanent budget guardrails:

explicit client max_tokens is hard-clamped by --max-tokens / NCP_HARD_MAX_TOKENS
omitted client max_tokens uses NCP_DEFAULT_MAX_TOKENS
oversized input is rejected early with an actionable Claude-specific error
combined input/output budget is reduced before upstream request when possible

This permanently fixes the common failure mode where Claude Code asks for too much output or continues a giant session until the provider rejects it.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.4.5

May 10, 2026

1.4.4

May 10, 2026

1.4.3

May 10, 2026

This version

1.4.2

May 10, 2026

1.4.1

May 10, 2026

1.4.0

May 10, 2026

1.3.5

May 10, 2026

1.3.4

May 10, 2026

1.3.3

May 8, 2026

1.3.2

May 8, 2026

1.3.1

May 7, 2026

1.3.0

May 7, 2026

1.2.0

May 7, 2026

1.1.8

May 5, 2026

1.1.7

May 5, 2026

1.1.6

May 5, 2026

1.1.5

May 4, 2026

1.1.4

May 4, 2026

1.1.3

May 4, 2026

1.1.2

May 4, 2026

1.1.0

May 3, 2026

1.0.8

May 2, 2026

1.0.7

May 2, 2026

1.0.6

May 1, 2026

1.0.5

Apr 29, 2026

1.0.4

Apr 29, 2026

1.0.3

Apr 27, 2026

1.0.1

Apr 25, 2026

1.0.0

Apr 25, 2026

0.9.3

Apr 25, 2026

0.9.2

Apr 25, 2026

0.9.1

Apr 25, 2026

0.9.0

Apr 25, 2026

0.8.9

Apr 25, 2026

0.8.8

Apr 23, 2026

0.8.7

Apr 23, 2026

0.8.6

Apr 23, 2026

0.8.5

Apr 23, 2026

0.8.4

Apr 23, 2026

0.8.3

Apr 23, 2026

0.8.2

Apr 23, 2026

0.8.1

Apr 23, 2026

0.8.0

Apr 23, 2026

0.7.2

Apr 23, 2026

0.7.1

Apr 23, 2026

0.7.0

Apr 23, 2026

0.6.3

Apr 22, 2026

0.6.2

Apr 22, 2026

0.5.8

Apr 22, 2026

0.5.7

Apr 22, 2026

0.5.6

Apr 22, 2026

0.5.5

Apr 22, 2026

0.5.4

Apr 22, 2026

0.5.3

Apr 22, 2026

0.5.2

Apr 22, 2026

0.5.1

Apr 22, 2026

0.5.0

Apr 22, 2026

0.4.5

Apr 22, 2026

0.4.4

Apr 21, 2026

0.4.3

Apr 21, 2026

0.4.2

Apr 21, 2026

0.4.1

Apr 21, 2026

0.4.0

Apr 21, 2026

0.3.9

Apr 21, 2026

0.3.8

Apr 21, 2026

0.3.7

Apr 21, 2026

0.3.6

Apr 20, 2026

0.3.5

Apr 20, 2026

0.3.4

Apr 20, 2026

0.3.3

Apr 20, 2026

0.3.2

Apr 20, 2026

0.3.1

Apr 20, 2026

0.3.0

Apr 20, 2026

0.2.9

Apr 20, 2026

0.2.8

Apr 20, 2026

0.2.7

Apr 20, 2026

0.2.6

Apr 20, 2026

0.2.5

Apr 20, 2026

0.2.4

Apr 20, 2026

0.2.3

Apr 20, 2026

0.2.2

Apr 20, 2026

0.2.0

Apr 20, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nvd_claude_proxy-1.4.2.tar.gz (139.3 kB view details)

Uploaded May 10, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

nvd_claude_proxy-1.4.2-py3-none-any.whl (164.4 kB view details)

Uploaded May 10, 2026 Python 3

File details

Details for the file nvd_claude_proxy-1.4.2.tar.gz.

File metadata

Download URL: nvd_claude_proxy-1.4.2.tar.gz
Upload date: May 10, 2026
Size: 139.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for nvd_claude_proxy-1.4.2.tar.gz
Algorithm	Hash digest
SHA256	`9e763da75dfda817ce1579ef5d240c79431cd90873aa4dd019b05dbbf4fda597`
MD5	`1f9b0dbce3dcc8666cf80a5e2bc2f60e`
BLAKE2b-256	`6290315218775af3cbac3bf74b87a5795e4716d7de2472c41dd4271bafa377b7`

See more details on using hashes here.

File details

Details for the file nvd_claude_proxy-1.4.2-py3-none-any.whl.

File metadata

Download URL: nvd_claude_proxy-1.4.2-py3-none-any.whl
Upload date: May 10, 2026
Size: 164.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for nvd_claude_proxy-1.4.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`02c475033bcca8d9bc55bf0dca727dfe6bc25a2654ac2bad6629e7961365b88a`
MD5	`162e11e1004d73d3dec217b775407250`
BLAKE2b-256	`dcc74374155d16c136ebca92ce7ba1f555749c0e3936e59e11620b929f848909`

See more details on using hashes here.

nvd-claude-proxy 1.4.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

nvd-claude-proxy

🚀 Key Features

🛠 Deployment & Configuration

Environment Variables

Quick Start

🏗 Architecture

Production Claude Code + NVIDIA NIM configuration

Recommended production notes

R2 low-latency mode

Start R2 mode

Why use R2 mode

Default runtime in 1.4.0

Streaming quality and visualization

R2 streaming environment knobs

Visualization endpoint

Stream dashboard

Default max tokens

Automatic fallback and context-safe retries

1.4.1 stability upgrade

1.4.2 classic R2 restore

Primary command

Permanent fix for context overflow

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes