Unified LLM usage management — API proxy, session diagnostics, multi-CLI orchestration.

These details have not been verified by PyPI

Project links

Project description

llm-relay

Unified LLM usage management — API proxy, session diagnostics, multi-CLI orchestration.

Why

This project started from a need to escape deep vendor lock-in with a single AI coding tool. After investigating hidden behaviors in Claude Code — silent token inflation, false rate limits, context stripping, and opaque feature flags — it became clear that relying on one vendor's black box was a risk. llm-relay was built to take back visibility and control: monitor what's actually happening, diagnose problems independently, and orchestrate across multiple CLI tools (Claude Code, Codex, Gemini) so no single provider becomes a single point of failure.

Features

Proxy: Transparent API proxy with cache/token monitoring and 12-strategy pruning
Detect: 7 detectors (orphan, stuck, synthetic, bloat, cache, resume, microcompact)
Recover: Session recovery and doctor (7 health checks)
Guard: 4-tier threshold daemon with dual-zone classification
Cost: Per-1% cost calculation and rate-limit header analysis
Orch: Multi-CLI orchestration (Claude Code, Codex CLI, Gemini CLI)
Display: Multi-CLI session monitor with context composition pie chart, connection type badges (SSH/tmux/tailscale/mosh), and provider liveness detection
History: Proxy-level conversation capture with delta/full storage, compaction detection, and web replay viewer
Composition: Real-time context window analysis — classifies content into 6 categories (user/assistant/tool_use/tool_result/thinking/system) with SNR metrics and duplicate read tracking
Monitoring: Quota utilization (Q5h/Q7d), cache hit rate, error rate (2xx/4xx/5xx/429), TTL tier detection (1h/5m) — all surfaced from data already collected by the proxy
TUI: llm-relay top — btop-style terminal monitor with Rich Live (works over SSH, no browser needed)
i18n: Browser locale detection with en/ko support; server-side override via LLM_RELAY_LANG
MCP: 8 tools via stdio transport (cli_delegate, cli_status, cli_probe, orch_delegate, orch_history, relay_stats, session_turns, session_history)

Install

# CLI only (diagnostics, recovery, orchestration)
pip install llm-relay

# With Rich TUI (llm-relay top)
pip install llm-relay[cli]

# With proxy + web dashboard
pip install llm-relay[proxy]

# With MCP server (Python 3.10+)
pip install llm-relay[mcp]

# Everything
pip install llm-relay[all]

Quick Start

One-command setup (recommended)

pip install llm-relay[all]
llm-relay init

This single command:

Detects installed CLIs (Claude Code, Codex, Gemini)
Initializes the database (~/.llm-relay/usage.db)
Configures Claude Code to route through the proxy (ANTHROPIC_BASE_URL)
Registers the MCP server in Claude Code (8 tools)
Starts the proxy server with history enabled
Runs a health check to verify everything works

After init, open: http://localhost:8083/dashboard/

Options: --dry-run (preview without changes), --skip-server (configure only), --port 9090 (custom port).

Manual setup

# CLI diagnostics only (no server needed)
pip install llm-relay
llm-relay scan              # Session health check (7 detectors)
llm-relay doctor            # Configuration health check (7 checks)
llm-relay top               # Live terminal monitor (btop-style TUI)

# Web dashboard
pip install llm-relay[proxy]
llm-relay serve             # Starts proxy + dashboard on port 8083

# Then configure Claude Code to use the proxy:
# In ~/.claude/settings.json, add:
#   "env": { "ANTHROPIC_BASE_URL": "http://localhost:8083" }

Web pages:

/dashboard/ — CLI status, cost, quota, error rate, cache hit rate, Turn Monitor
/display/ — Turn counter with context composition, connection type badges
/history/ — Session conversation replay with compaction timeline

MCP server

llm-relay-mcp               # stdio transport, 8 tools

API Endpoints

All endpoints are served by the proxy at http://localhost:8083/api/v1/.

Endpoint	Description
`GET /api/v1/turns`	Turn counts + token metrics + zone classification for active sessions
`GET /api/v1/turns/{session_id}`	Per-session metrics with cache hit rate and TTL tier
`GET /api/v1/display`	Session cards with prompts, terminal info, composition
`GET /api/v1/quota`	Anthropic Q5h/Q7d quota utilization and overage status
`GET /api/v1/errors`	Error rate breakdown (2xx/4xx/5xx/429)
`GET /api/v1/cache`	Cache hit rate (global or per-session)
`GET /api/v1/ttl`	Cache TTL tier detection (1h/5m/mixed)
`GET /api/v1/health`	CLI + proxy + orchestration DB health
`GET /api/v1/cost`	Cost breakdown by model
`GET /api/v1/sessions`	Proxy session summaries
`GET /api/v1/cli/status`	CLI installation and auth status
`GET /api/v1/delegations`	Multi-CLI delegation history
`GET /api/v1/delegations/stats`	Delegation aggregate statistics
`GET /api/v1/history`	Sessions with conversation history
`GET /api/v1/history/{session_id}`	Conversation turns for a session
`GET /api/v1/history/{session_id}/compactions`	Compaction events
`GET /api/v1/history/{session_id}/composition`	Per-turn context composition
`GET /api/v1/i18n`	Locale-specific UI messages

CLI Status

CLI	Status
Claude Code	Fully supported
OpenAI Codex	Fully supported
Gemini CLI	Display supported, oauth-personal has known 403 server-side bug (#25425)

Requirements

Python >= 3.9
MCP tools require Python >= 3.10

License

MIT

Ecosystem

Part of the QuartzUnit open-source ecosystem.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.9.6

May 21, 2026

0.9.5

May 21, 2026

0.9.4

May 21, 2026

0.9.1

Apr 29, 2026

0.9.0

Apr 29, 2026

0.8.6 yanked

Apr 28, 2026

Reason this release was yanked: