Skip to main content

Relay — production-grade multi-provider LLM client. One YAML, one interface, every model.

Project description

Relay

The fastest, lightest BYOK relay for any and every LLM model — open source.

CI Apache-2.0 Python 3.10+

A Python library that gives you one interface to every major LLM — chat, streaming, tool calls, structured output, batch, MCP — defined in a YAML file you check into your repo. Production-grade, enterprise-ready, OSS.

~5–19× faster cold start than LiteLLM, ~20% faster streaming TTFT, and tied at the median on chat overhead with more consistent tails (reproducible benchmarks).

pip install ai5labs-relay
from relay import Hub

async with Hub.from_yaml("models.yaml") as hub:
    resp = await hub.chat(
        "fast-cheap",
        messages=[{"role": "user", "content": "What is 2+2?"}],
    )
    print(resp.text)
    print(resp.cost_usd, resp.cost.source)

Why Relay

LiteLLM LangChain Relay
YAML model catalog
Built-in pricing snapshot with provenance partial
Live pricing (Bedrock, Azure, OpenRouter)
Tool-call streaming deltas keyed by index (not id) bug (#20711) n/a
MCP universal tool layer (any MCP server → any provider)
Cross-provider tool-schema compiler with Mastra-style fallback
Pydantic structured output (compiles per-provider, not text-coerced) partial
Hub-level cache + Anthropic prompt-cache passthrough partial
Circuit breakers with cooldown + half-open probes
OpenTelemetry GenAI semantic conventions (opt-in)
Reasoning budget unification across OpenAI/Anthropic/Gemini
OpenAI Responses API opt-in (alongside Chat Completions)
Batch API wrapper (OpenAI Batch + Anthropic Message Batches, ~50% off)
Native Bedrock / Azure / Gemini / Vertex / Cohere adapters OpenAI-compat shims partial native
PII redaction pipeline (regex + Presidio hooks)
Audit logging (OTel-aligned schema, pluggable sinks) enterprise SKU
Pre/post guardrails (max-input, blocked-keywords, plugin-able) enterprise SKU
Anthropic thinking blocks preserved flattened flattened
Typed errors (rate-limit / context-window / content-policy distinct) partial
mypy --strict (3 codes opted-out, see pyproject.toml)
Apache-2.0 with explicit patent grant MIT MIT

Quickstart

1. Define your models

Create models.yaml:

# yaml-language-server: $schema=./relay.schema.json
# (generate the schema file once with: `relay schema --out relay.schema.json`)
version: 1

models:
  fast-cheap:
    target: groq/llama-3.3-70b-versatile
    credential: $env.GROQ_API_KEY

  smart:
    target: anthropic/claude-sonnet-4-5
    credential: $env.ANTHROPIC_API_KEY
    params:
      max_tokens: 4096

  cheap-vision:
    target: openai/gpt-4o-mini
    credential: $env.OPENAI_API_KEY

groups:
  default:
    strategy: fallback
    members: [smart, fast-cheap]    # try smart first, fall back to fast-cheap

Then point your editor at the schema URL on line 1 — the Red Hat YAML extension for VS Code will give you autocomplete and inline validation while editing.

2. Use it

from relay import Hub

async with Hub.from_yaml("models.yaml") as hub:
    # Single model
    resp = await hub.chat("fast-cheap", messages=[
        {"role": "user", "content": "Hello"}
    ])

    # Group with fallback
    resp = await hub.chat("default", messages=[...])

    # Streaming
    async for ev in hub.stream("smart", messages=[...]):
        if ev.type == "text_delta":
            print(ev.text, end="", flush=True)
        elif ev.type == "thinking_delta":     # Anthropic extended thinking
            ...
        elif ev.type == "end":
            print(f"\nDone in {ev.response.latency_ms:.0f}ms, "
                  f"${ev.response.cost_usd:.4f}")

    # Bound handle for hot loops
    model = hub.get("fast-cheap")
    for prompt in prompts:
        resp = await model.chat(messages=[{"role": "user", "content": prompt}])

3. CLI

relay schema --out relay.schema.json     # JSON Schema for editors / docs
relay validate models.yaml               # validate config
relay models list                        # list configured aliases
relay models inspect smart               # show one alias's full config + catalog row
relay models compare sonnet 4o flash     # side-by-side: price, speed, MMLU, GPQA, HumanEval...
relay models recommend --task code --budget cheap --needs tools  # which model for the job?
relay catalog list --provider anthropic  # browse the built-in catalog
relay providers                          # list all supported providers

Supported providers

OpenAI-compatible (one adapter): OpenAI, Groq, Together, DeepSeek, xAI, Mistral, Fireworks, Perplexity, OpenRouter, Ollama, vLLM, LM Studio.

Native (proper, lossless adapters): Anthropic, Azure OpenAI, AWS Bedrock, Cohere, Google Gemini direct, Vertex AI.

Routing

relay.routing is the public extension point for picking a model per call. Two implementations ship with v0.2:

  • RuleBasedRouter — deterministic, constraint-driven, in-process. Same scoring logic as relay models recommend, free.
  • SemanticRouter — HTTP client for the hosted semantic router (paid, optional). Wire protocol documented in docs/routing/api-spec.md.

Attach a router and call chat_routed instead of chat — Relay picks the alias, falls back through alternates on error, and stamps the decision onto response.metadata["routing"]. Custom routers satisfying the Router Protocol are accepted. See docs/routing/usage.md for examples.

Pricing & cost tracking

Every response carries a Cost object with full provenance:

resp.cost.total_usd        # 0.00234
resp.cost.source           # "live_api" | "snapshot" | "user_override" | "estimated" | "unknown"
resp.cost.confidence       # "exact" | "list_price" | "estimated" | "unknown"
resp.cost.fetched_at       # ISO 8601 timestamp (when fetched live)

Tier order (first match wins):

  1. User override — explicit cost: block on a model entry, or a pricing_profile.
  2. Live APIs (cached 6h in-process):
    • AWS Pricing API for Bedrock
    • Azure Retail Prices API for Azure OpenAI
    • OpenRouter /api/v1/models for ~400 models from OpenAI, Anthropic, Google, Groq, etc. at list price
  3. Snapshot — JSON shipped with each release, regenerated weekly via CI.
  4. Unknowncost_usd = None, never wrong-by-default.

Negotiated rates

No public API exposes enterprise discounts (AWS EDP, Azure committed-use, OpenAI custom tiers). Configure them yourself:

pricing_profiles:
  acme-aws-prod:
    description: "15% EDP discount"
    input_multiplier: 0.85
    output_multiplier: 0.85

  openai-team-tier:
    fixed_overrides:
      openai/gpt-4o:
        input_per_1m: 1.25
        output_per_1m: 5.00

models:
  bedrock-sonnet:
    target: bedrock/anthropic.claude-sonnet-4-5-20250929-v1:0
    region: us-east-1
    credential: { type: aws_profile, profile: prod }
    pricing_profile: acme-aws-prod

Production-grade design

  • Connection pooling: one httpx.AsyncClient per (provider, base_url), HTTP/2 enabled, keep-alive tuned for streaming workloads.
  • Lazy SDK imports: boto3 and other heavy deps only load when their first call happens.
  • Streaming hot path uses orjson and dicts — no Pydantic validation per-token. Pydantic only runs on the final assembled response.
  • Tool-call delta merging keyed by index, not id. (LiteLLM keys by id and drops ~90% of argument deltas — issue #20711.)
  • Provider-specific blocks preserved: Anthropic thinking, Gemini grounding, citations — emitted as typed events, not flattened.
  • Classified errors: RateLimitError, ContextWindowError, ContentPolicyError, AuthenticationError are distinct types — fall back vs retry vs fail-fast can be decided automatically.
  • OpenTelemetry GenAI semantic conventions (opt-in): emits gen_ai.* spans + metrics that Datadog, Honeycomb, Langfuse, and Arize all consume.

Security

  • Keys never inline in YAML — credentials are reified objects (env var, AWS Secrets Manager, GCP Secret Manager, Vault).
  • Library, not a hosted proxy by default. Your API keys stay in your process. (Compare: the LiteLLM proxy PyPI compromise of March 2026 leaked keys from every centralized deployment.)
  • Releases will be Sigstore-signed via OIDC Trusted Publishing.
  • See SECURITY.md for vulnerability reporting.

Status

v0.2.2 (alpha) — chat, streaming, tool calls, structured output, batch (OpenAI Batch + Anthropic Message Batches), MCP, Hub-level cache + provider-cache passthrough, PII redaction, audit logging, pre/post guardrails, OpenTelemetry GenAI semantic conventions, cost tracking with live pricing, 12 OpenAI-compatible providers + 6 native adapters (Anthropic, Azure OpenAI, AWS Bedrock, Cohere, Google Gemini direct, Vertex AI), plus opt-in OpenAI Responses API.

API surface is stable; everything under _internal/ and _* modules is not.

Development

uv sync --all-groups
uv run pytest
uv run ruff check
uv run mypy
uv run pyright

Contributing

See CONTRIBUTING.md. Please read CODE_OF_CONDUCT.md before opening a PR.

Support

Relay is free, Apache-2.0, and actively maintained by ai5labs Research OPC Pvt Ltd. If your team uses it in production, please consider:

  • Star the repo — actually helps a lot at this stage
  • 🤝 Become a design partner — direct line to maintainers, roadmap influence, free for the program duration
  • 🏢 Enterprise support (planned for v0.3, Q3 2026) — SLAs, custom features, VPC deployment, SOC 2, BAA/DPA on the roadmap. Email engineering@ai5labs.com to be a design partner.

See SUPPORT.md for full details.

License

Apache-2.0. See LICENSE. Copyright © 2026 ai5labs Research OPC Pvt Ltd.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ai5labs_relay-0.2.3.tar.gz (126.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ai5labs_relay-0.2.3-py3-none-any.whl (155.6 kB view details)

Uploaded Python 3

File details

Details for the file ai5labs_relay-0.2.3.tar.gz.

File metadata

  • Download URL: ai5labs_relay-0.2.3.tar.gz
  • Upload date:
  • Size: 126.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for ai5labs_relay-0.2.3.tar.gz
Algorithm Hash digest
SHA256 54326486d7c82f71c36758395db7a032e9e8553f31231d28f63d06f1fbc67881
MD5 c476d0dcaa9b6187be41e306a353387c
BLAKE2b-256 a525fba68df879f928ab82e37410fb1e0197601250b542bd0351fe625337f714

See more details on using hashes here.

Provenance

The following attestation bundles were made for ai5labs_relay-0.2.3.tar.gz:

Publisher: release.yml on ai5labs/relay-llm

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file ai5labs_relay-0.2.3-py3-none-any.whl.

File metadata

  • Download URL: ai5labs_relay-0.2.3-py3-none-any.whl
  • Upload date:
  • Size: 155.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for ai5labs_relay-0.2.3-py3-none-any.whl
Algorithm Hash digest
SHA256 d13dcbe24bd671aff60547a821ce2980725c537b471b72255bed9540e7b35a5f
MD5 288fb3e2eda48c5cd13eefe69f83de02
BLAKE2b-256 e685fcafd7392d7be91ae94d5a7317aa2fa9b19f560637c9ff8ead2587811ef5

See more details on using hashes here.

Provenance

The following attestation bundles were made for ai5labs_relay-0.2.3-py3-none-any.whl:

Publisher: release.yml on ai5labs/relay-llm

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page