Unified LLM usage management — API proxy, session diagnostics, multi-CLI orchestration.

These details have not been verified by PyPI

Project links

Project description

llm-relay

Unified LLM usage management — API proxy, session diagnostics, multi-CLI orchestration.

Why

This project started from a need to escape deep vendor lock-in with a single AI coding tool. After investigating hidden behaviors in Claude Code — silent token inflation, false rate limits, context stripping, and opaque feature flags — it became clear that relying on one vendor's black box was a risk. llm-relay was built to take back visibility and control: monitor what's actually happening, diagnose problems independently, and orchestrate across multiple CLI tools (Claude Code, Codex, Gemini) so no single provider becomes a single point of failure.

Features

Proxy: Transparent API proxy with cache/token monitoring and 12-strategy pruning
Detect: 7 detectors (orphan, stuck, synthetic, bloat, cache, resume, microcompact)
Recover: Session recovery and doctor (7 health checks)
Guard: 4-tier threshold daemon with dual-zone classification
Cost: Per-1% cost calculation and rate-limit header analysis
Orch: Multi-CLI orchestration (Claude Code, Codex CLI, Gemini CLI)
Display: Multi-CLI session monitor with context composition pie chart, connection type badges (SSH/tmux/tailscale/mosh), and provider liveness detection
History: Proxy-level conversation capture with delta/full storage, compaction detection, and web replay viewer
Composition: Real-time context window analysis — classifies content into 6 categories (user/assistant/tool_use/tool_result/thinking/system) with SNR metrics and duplicate read tracking
Monitoring: Quota utilization (Q5h/Q7d), cache hit rate, error rate (2xx/4xx/5xx/429), TTL tier detection (1h/5m) — all surfaced from data already collected by the proxy
TUI: llm-relay top — btop-style terminal monitor with Rich Live (works over SSH, no browser needed)
i18n: Browser locale detection with en/ko support; server-side override via LLM_RELAY_LANG
MCP: 8 tools via stdio transport (cli_delegate, cli_status, cli_probe, orch_delegate, orch_history, relay_stats, session_turns, session_history)

Install

# CLI only (diagnostics, recovery, orchestration)
pip install llm-relay

# With Rich TUI (llm-relay top)
pip install llm-relay[cli]

# With proxy + web dashboard
pip install llm-relay[proxy]

# With MCP server (Python 3.10+)
pip install llm-relay[mcp]

# Everything
pip install llm-relay[all]

Quick Start

One-command setup (recommended)

pip install llm-relay[all]
llm-relay init

This single command:

Detects installed CLIs (Claude Code, Codex, Gemini)
Initializes the database (~/.llm-relay/usage.db)
Configures Claude Code to route through the proxy (ANTHROPIC_BASE_URL)
Registers the MCP server in Claude Code (8 tools)
Starts the proxy server with history enabled
Runs a health check to verify everything works

After init, open: http://localhost:8083/dashboard/

Options: --dry-run (preview without changes), --skip-server (configure only), --port 9090 (custom port).

Manual setup

# CLI diagnostics only (no server needed)
pip install llm-relay
llm-relay scan              # Session health check (7 detectors)
llm-relay doctor            # Configuration health check (7 checks)
llm-relay top               # Live terminal monitor (btop-style TUI)

# Web dashboard
pip install llm-relay[proxy]
llm-relay serve             # Starts proxy + dashboard on port 8083

# Then configure Claude Code to use the proxy:
# In ~/.claude/settings.json, add:
#   "env": { "ANTHROPIC_BASE_URL": "http://localhost:8083" }

Web pages:

/dashboard/ — CLI status, cost, quota, error rate, cache hit rate, Turn Monitor
/display/ — Turn counter with context composition, connection type badges
/history/ — Session conversation replay with compaction timeline

MCP server

llm-relay-mcp               # stdio transport, 8 tools

API Endpoints

All endpoints are served by the proxy at http://localhost:8083/api/v1/.

Endpoint	Description
`GET /api/v1/turns`	Turn counts + token metrics + zone classification for active sessions
`GET /api/v1/turns/{session_id}`	Per-session metrics with cache hit rate and TTL tier
`GET /api/v1/display`	Session cards with prompts, terminal info, composition
`GET /api/v1/quota`	Anthropic Q5h/Q7d quota utilization and overage status
`GET /api/v1/errors`	Error rate breakdown (2xx/4xx/5xx/429)
`GET /api/v1/cache`	Cache hit rate (global or per-session)
`GET /api/v1/ttl`	Cache TTL tier detection (1h/5m/mixed)
`GET /api/v1/health`	CLI + proxy + orchestration DB health
`GET /api/v1/cost`	Cost breakdown by model
`GET /api/v1/sessions`	Proxy session summaries
`GET /api/v1/cli/status`	CLI installation and auth status
`GET /api/v1/delegations`	Multi-CLI delegation history
`GET /api/v1/delegations/stats`	Delegation aggregate statistics
`GET /api/v1/history`	Sessions with conversation history
`GET /api/v1/history/{session_id}`	Conversation turns for a session
`GET /api/v1/history/{session_id}/compactions`	Compaction events
`GET /api/v1/history/{session_id}/composition`	Per-turn context composition
`GET /api/v1/i18n`	Locale-specific UI messages

CLI Status

CLI	Status
Claude Code	Fully supported
OpenAI Codex	Fully supported
Gemini CLI	Display supported, oauth-personal has known 403 server-side bug (#25425)

Requirements

Python >= 3.9
MCP tools require Python >= 3.10

License

MIT

Ecosystem

Part of the QuartzUnit open-source ecosystem.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.9.6

May 21, 2026

0.9.5

May 21, 2026

0.9.4

May 21, 2026

0.9.1

Apr 29, 2026

0.9.0

Apr 29, 2026

0.8.6 yanked

Apr 28, 2026