Skip to main content

Route LLM API calls through subscription-authenticated coding agents (Claude Code, Codex, Gemini CLI)

Project description

SubLLM Banner


SubLLM

Route standard LLM API calls through subscription-authenticated coding agents instead of API keys.

Use your Claude Pro/Max, ChatGPT Plus/Pro, or Google AI Pro/Ultra subscription as the backend for programmatic LLM calls. SubLLM provides an OpenAI-compatible unified interface that abstracts Claude Code, Codex, and Gemini CLIs behind a standard completion() API.

Why?

You already pay for a subscription — Claude Pro/Max, ChatGPT Plus/Pro, Google AI Pro/Ultra — but the moment you need to run a script, you're forced to pay again for API access to the same models. Subscription plans offer flat-rate access but lock you into chat UIs with no programmatic interface. API keys unlock programmatic access but bill per-token on top of what you're already paying.

SubLLM eliminates the double-pay problem. It routes standard completion() calls through CLI tools that authenticate via your existing subscriptions — turning flat-rate chat plans into programmable LLM backends.

  • Flat-rate pricing — pay your subscription, not per-token. Heavy usage costs the same as light usage.
  • Zero API key management — CLIs authenticate through your subscription. No keys to provision, rotate, or secure.
  • OpenAI-compatible interface — standard completion() API with OpenAI ChatCompletion response format. Swap SubLLM in/out of existing code with minimal changes.
  • Cross-provider — same API surface for Claude, GPT, and Gemini. Switch models by changing a string.
Approach Cost (heavy usage) Overhead Flexibility
Direct API (per-token) $50-500+/mo ~0s Full control
SubLLM (subscription) $0-200/mo flat ~270ms Good for batch/dev
Direct API + LiteLLM $50-500+/mo ~0s Multi-provider

Best for: development, prototyping, batch jobs, CI/CD, personal automation, cost-sensitive pipelines.

Not for: real-time chat UIs, latency-sensitive production services, multi-tenant SaaS (ToS constraints), workloads requiring tool use / function calling / prompt caching / logprobs / stop sequences (these features execute inside CLI sandboxes and are not exposed).

Quick Start

1. Install

uv add subllm              # Core (includes Claude Agent SDK)
uv add subllm[server]      # + OpenAI-compatible proxy server

2. Authenticate Your CLIs

Claude Code (subscription auth):

curl -fsSL https://claude.ai/install.sh | bash
claude login
unset ANTHROPIC_API_KEY          # Force subscription auth

For headless/CI:

claude setup-token
export CLAUDE_CODE_OAUTH_TOKEN="your-token-here"

Codex (subscription auth):

npm install -g @openai/codex
codex login

Gemini CLI (free tier or subscription):

npm install -g @google/gemini-cli
gemini                           # Complete Google login
# Or use API key:
export GEMINI_API_KEY="your-key"

3. Use

Python API:

import asyncio
import subllm

async def main():
    # Non-streaming
    response = await subllm.completion(
        model="claude-code/sonnet-4-5",
        messages=[{"role": "user", "content": "Explain monads in one sentence"}],
    )
    print(response.choices[0].message.content)

    # Streaming
    stream = await subllm.completion(
        model="gemini/gemini-3-flash-preview",
        messages=[{"role": "user", "content": "Write a haiku about Rust"}],
        stream=True,
    )
    async for chunk in stream:
        print(chunk.choices[0].delta.content, end="")

asyncio.run(main())

CLI:

subllm auth                                        # Check all providers
subllm models                                      # List available models
subllm complete "What is 2+2?" -m claude-code/sonnet-4-5
subllm complete "Write a haiku" -m gemini/gemini-3-flash-preview --stream

OpenAI-compatible proxy:

subllm serve --port 8080

# Then use ANY OpenAI-compatible client
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="unused")
response = client.chat.completions.create(
    model="claude-code/sonnet-4-5",
    messages=[{"role": "user", "content": "hello"}],
)

Available Models

Model ID Backend Auth
claude-code/opus-4-6 Claude Opus 4.6 Claude Max ($200)
claude-code/sonnet-4-5 Claude Sonnet 4.5 Claude Pro ($20) / Max ($100-200)
claude-code/haiku-4-5 Claude Haiku 4.5 Claude Pro ($20) / Max ($100-200)
codex/gpt-5.2 GPT-5.2 ChatGPT Plus ($20) / Pro ($200)
codex/gpt-5.2-codex GPT-5.2-Codex ChatGPT Plus ($20) / Pro ($200)
codex/gpt-4.1 GPT-4.1 ChatGPT Plus ($20) / Pro ($200)
codex/gpt-5-mini GPT-5 Mini ChatGPT Plus ($20) / Pro ($200)
gemini/gemini-3-pro-preview Gemini 3 Pro Preview API key / AI Pro / AI Ultra
gemini/gemini-3-flash-preview Gemini 3 Flash Preview API key / AI Pro / AI Ultra

Architecture

User Code ──→ subllm.completion() ──→ Router
                                       ├── ClaudeCodeProvider
                                       │     └── claude-agent-sdk (persistent client)
                                       ├── CodexProvider
                                       │     └── codex exec (subprocess)
                                       └── GeminiCLIProvider
                                             └── gemini -p (subprocess)

All providers delegate auth entirely to the underlying CLIs. SubLLM never stores or manages tokens directly. Multi-turn conversations use stateless message replay — the full conversation history is sent each turn.

Batch Processing

results = await subllm.batch([
    {"model": "claude-code/sonnet-4-5", "messages": [...]},
    {"model": "gemini/gemini-3-flash-preview", "messages": [...]},
    {"model": "codex/gpt-5.2", "messages": [...]},
], concurrency=5)

Runs completions in parallel with a concurrency semaphore. Each provider's CLI handles its own rate limiting internally.

Benchmarks

Measured on macOS. Timings include full CLI subprocess overhead (spawn, auth, inference, response parsing). Single run — expect variance across sessions.

Auth Check

Provider Method Latency
claude-code claude auth status ~266ms
codex subscription check ~97ms
gemini OAuth credential file ~2ms
all (parallel) asyncio.gather ~270ms

Auth is bounded by the slowest provider. Previous sequential approach with inference roundtrips: ~30s total.

Completion

Provider Model Non-streaming Streaming
claude-code sonnet-4-5 ~13-17s ~9-15s
codex gpt-5.2 ~11-15s ~11-12s
gemini gemini-3-flash-preview ~8-9s ~8-9s

Multi-turn

Provider Model Turn 1 Turn 2
claude-code sonnet-4-5 ~16s ~14s
codex gpt-5.2 ~15s ~12s
gemini gemini-3-flash-preview ~13s ~14s

Full conversation history replayed each turn (stateless). Turn 2 carries Turn 1 context.

Cross-provider Handoff

Message history replayed across different providers within a single conversation:

Turn Provider Latency
1 (remember) claude-code/sonnet-4-5 ~16s
2 (recall) codex/gpt-5.2 ~11s
3 (verify) gemini/gemini-3-flash-preview ~14s

Batch (3 parallel completions)

Scope Latency
claude-code ~17s
codex ~14s
gemini ~9s
cross-provider ~14s

Parallel execution bounded by the slowest request.

Terms of Service & Disclaimer

SubLLM routes completion calls through CLI tools that authenticate via your subscription. It does not circumvent authentication, store credentials, or proxy third-party access. Users are responsible for compliance with each provider's terms of service.

  • Anthropic — Anthropic prohibits third-party developers from offering claude.ai login for their products. The "user brings their own authenticated CLI" pattern is established by Cline, Zed, and Repo Prompt. Safe for personal and team use. Do not ship as a SaaS product.
  • OpenAI — Codex CLI explicitly supports ChatGPT subscription auth and programmatic exec mode. Officially supported pattern.
  • Google — Gemini CLI supports Google OAuth and API key auth. Standard usage pattern.

Review each provider's current ToS before use. Terms change.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

subllm-0.4.0.tar.gz (654.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

subllm-0.4.0-py3-none-any.whl (23.8 kB view details)

Uploaded Python 3

File details

Details for the file subllm-0.4.0.tar.gz.

File metadata

  • Download URL: subllm-0.4.0.tar.gz
  • Upload date:
  • Size: 654.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.8

File hashes

Hashes for subllm-0.4.0.tar.gz
Algorithm Hash digest
SHA256 6b6af1eb3b1ab026d39c42e7c6f78c013fae216248266e41c975da52e87be480
MD5 a5c02dd62d490c50f6a73050fe9f6785
BLAKE2b-256 b746359c683fc49789c9e7804f98bf6b8b09e5356adb1d3ca5d41efe57cc0269

See more details on using hashes here.

File details

Details for the file subllm-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: subllm-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 23.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.8

File hashes

Hashes for subllm-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 aeb0f279e2e60c76d29297bb749628f1cc1e6fd25463b88f14898d3ec4e28393
MD5 08aef1796c3a10b46aaba2fd09403645
BLAKE2b-256 52dc597c63c303fd127f910c423c06036afb4f8fa182b0aae9e68e0f9f5e8380

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page