Skip to main content

Relay — production-grade multi-provider LLM client. One YAML, one interface, every model.

Project description

Relay

The fastest, lightest BYOK relay for any and every LLM model — open source.

CI Apache-2.0 Python 3.10+ Sponsor

A Python library that gives you one interface to every major LLM — chat, streaming, tool calls, structured output, batch, MCP — defined in a YAML file you check into your repo. Production-grade, enterprise-ready, OSS.

~5–19× faster cold start than LiteLLM, ~20% faster streaming TTFT, and tied at the median on chat overhead with more consistent tails (reproducible benchmarks).

pip install ai5labs-relay
from relay import Hub

async with Hub.from_yaml("models.yaml") as hub:
    resp = await hub.chat(
        "fast-cheap",
        messages=[{"role": "user", "content": "What is 2+2?"}],
    )
    print(resp.text)
    print(resp.cost_usd, resp.cost.source)

Why Relay

LiteLLM LangChain Relay
YAML model catalog
Built-in pricing snapshot with provenance partial
Live pricing (Bedrock, Azure, OpenRouter)
Tool-call streaming deltas keyed by index (not id) bug (#20711) n/a
MCP universal tool layer (any MCP server → any provider)
Cross-provider tool-schema compiler with Mastra-style fallback
Pydantic structured output (compiles per-provider, not text-coerced) partial
Hub-level cache + Anthropic prompt-cache passthrough partial
Circuit breakers with cooldown + half-open probes
OpenTelemetry GenAI semantic conventions (opt-in)
Reasoning budget unification across OpenAI/Anthropic/Gemini
OpenAI Responses API opt-in (alongside Chat Completions)
Batch API wrapper (OpenAI Batch + Anthropic Message Batches, ~50% off)
Native Bedrock / Azure / Gemini / Vertex / Cohere adapters OpenAI-compat shims partial native
PII redaction pipeline (regex + Presidio hooks)
Audit logging (OTel-aligned schema, pluggable sinks) enterprise SKU
Pre/post guardrails (max-input, blocked-keywords, plugin-able) enterprise SKU
Anthropic thinking blocks preserved flattened flattened
Typed errors (rate-limit / context-window / content-policy distinct) partial
OpenTelemetry GenAI semantic conventions ✓ (opt-in)
mypy --strict clean
Apache-2.0 with explicit patent grant MIT MIT

Quickstart

1. Define your models

Create models.yaml:

# yaml-language-server: $schema=https://relay.ai5labs.com/schema/v1.json
version: 1

models:
  fast-cheap:
    target: groq/llama-3.3-70b-versatile
    credential: $env.GROQ_API_KEY

  smart:
    target: anthropic/claude-sonnet-4-5
    credential: $env.ANTHROPIC_API_KEY
    params:
      max_tokens: 4096

  cheap-vision:
    target: openai/gpt-4o-mini
    credential: $env.OPENAI_API_KEY

groups:
  default:
    strategy: fallback
    members: [smart, fast-cheap]    # try smart first, fall back to fast-cheap

Then point your editor at the schema URL on line 1 — the Red Hat YAML extension for VS Code will give you autocomplete and inline validation while editing.

2. Use it

from relay import Hub

async with Hub.from_yaml("models.yaml") as hub:
    # Single model
    resp = await hub.chat("fast-cheap", messages=[
        {"role": "user", "content": "Hello"}
    ])

    # Group with fallback
    resp = await hub.chat("default", messages=[...])

    # Streaming
    async for ev in hub.stream("smart", messages=[...]):
        if ev.type == "text_delta":
            print(ev.text, end="", flush=True)
        elif ev.type == "thinking_delta":     # Anthropic extended thinking
            ...
        elif ev.type == "end":
            print(f"\nDone in {ev.response.latency_ms:.0f}ms, "
                  f"${ev.response.cost_usd:.4f}")

    # Bound handle for hot loops
    model = hub.get("fast-cheap")
    for prompt in prompts:
        resp = await model.chat(messages=[{"role": "user", "content": prompt}])

3. CLI

relay schema --out relay.schema.json     # JSON Schema for editors / docs
relay validate models.yaml               # validate config
relay models list                        # list configured aliases
relay models inspect smart               # show one alias's full config + catalog row
relay models compare sonnet 4o flash     # side-by-side: price, speed, MMLU, GPQA, HumanEval...
relay models recommend --task code --budget cheap --needs tools  # which model for the job?
relay catalog list --provider anthropic  # browse the built-in catalog
relay providers                          # list all supported providers

Supported providers

OpenAI-compatible (one adapter): OpenAI, Groq, Together, DeepSeek, xAI, Mistral, Fireworks, Perplexity, OpenRouter, Ollama, vLLM, LM Studio.

Native (proper, lossless adapters): Anthropic, Azure OpenAI, AWS Bedrock, Cohere, Google Gemini direct, Vertex AI.

Routing

relay.routing is the public extension point for picking a model per call. Two implementations ship with v0.2:

  • RuleBasedRouter — deterministic, constraint-driven, in-process. Same scoring logic as relay models recommend, free.
  • SemanticRouter — HTTP client for the hosted semantic router (paid, optional). Wire protocol documented in docs/routing/api-spec.md.

Attach a router and call chat_routed instead of chat — Relay picks the alias, falls back through alternates on error, and stamps the decision onto response.metadata["routing"]. Custom routers satisfying the Router Protocol are accepted. See docs/routing/usage.md for examples.

Pricing & cost tracking

Every response carries a Cost object with full provenance:

resp.cost.total_usd        # 0.00234
resp.cost.source           # "live_api" | "snapshot" | "user_override" | "estimated" | "unknown"
resp.cost.confidence       # "exact" | "list_price" | "estimated" | "unknown"
resp.cost.fetched_at       # ISO 8601 timestamp (when fetched live)

Tier order (first match wins):

  1. User override — explicit cost: block on a model entry, or a pricing_profile.
  2. Live APIs (cached 6h in-process):
    • AWS Pricing API for Bedrock
    • Azure Retail Prices API for Azure OpenAI
    • OpenRouter /api/v1/models for ~400 models from OpenAI, Anthropic, Google, Groq, etc. at list price
  3. Snapshot — JSON shipped with each release, regenerated weekly via CI.
  4. Unknowncost_usd = None, never wrong-by-default.

Negotiated rates

No public API exposes enterprise discounts (AWS EDP, Azure committed-use, OpenAI custom tiers). Configure them yourself:

pricing_profiles:
  acme-aws-prod:
    description: "15% EDP discount"
    input_multiplier: 0.85
    output_multiplier: 0.85

  openai-team-tier:
    fixed_overrides:
      openai/gpt-4o:
        input_per_1m: 1.25
        output_per_1m: 5.00

models:
  bedrock-sonnet:
    target: bedrock/anthropic.claude-sonnet-4-5-20250929-v1:0
    region: us-east-1
    credential: { type: aws_profile, profile: prod }
    pricing_profile: acme-aws-prod

Production-grade design

  • Connection pooling: one httpx.AsyncClient per (provider, base_url), HTTP/2 enabled, keep-alive tuned for streaming workloads.
  • Lazy SDK imports: boto3 and other heavy deps only load when their first call happens.
  • Streaming hot path uses orjson and dicts — no Pydantic validation per-token. Pydantic only runs on the final assembled response.
  • Tool-call delta merging keyed by index, not id. (LiteLLM keys by id and drops ~90% of argument deltas — issue #20711.)
  • Provider-specific blocks preserved: Anthropic thinking, Gemini grounding, citations — emitted as typed events, not flattened.
  • Classified errors: RateLimitError, ContextWindowError, ContentPolicyError, AuthenticationError are distinct types — fall back vs retry vs fail-fast can be decided automatically.
  • OpenTelemetry GenAI semantic conventions (opt-in): emits gen_ai.* spans + metrics that Datadog, Honeycomb, Langfuse, and Arize all consume.

Security

  • Keys never inline in YAML — credentials are reified objects (env var, AWS Secrets Manager, GCP Secret Manager, Vault).
  • Library, not a hosted proxy by default. Your API keys stay in your process. (Compare: the LiteLLM proxy PyPI compromise of March 2026 leaked keys from every centralized deployment.)
  • Releases will be Sigstore-signed via OIDC Trusted Publishing.
  • See SECURITY.md for vulnerability reporting.

Status

v0.1 (alpha) — chat + streaming + tool calls + cost tracking + the OpenAI-compatible adapter (12 providers) + native adapters for Anthropic, Azure OpenAI, AWS Bedrock, Cohere, Google Gemini direct, and Vertex AI.

API surface is stable; everything under _internal/ and _* modules is not.

Development

uv sync --all-groups
uv run pytest
uv run ruff check
uv run mypy
uv run pyright

Contributing

See CONTRIBUTING.md. Please read CODE_OF_CONDUCT.md before opening a PR.

Support & sponsorship

Relay is free, Apache-2.0, and actively maintained by ai5labs Research OPC Pvt Ltd. If your team uses it in production, please consider:

See SUPPORT.md for full details.

Sponsors

This space is for our sponsors — be the first.

Design partners

This space is for our design partners — currently accepting our first cohort.

License

Apache-2.0. See LICENSE. Copyright © 2026 ai5labs Research OPC Pvt Ltd.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ai5labs_relay-0.2.1.tar.gz (114.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ai5labs_relay-0.2.1-py3-none-any.whl (143.7 kB view details)

Uploaded Python 3

File details

Details for the file ai5labs_relay-0.2.1.tar.gz.

File metadata

  • Download URL: ai5labs_relay-0.2.1.tar.gz
  • Upload date:
  • Size: 114.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for ai5labs_relay-0.2.1.tar.gz
Algorithm Hash digest
SHA256 bda6288f8f7b7479c287337a85e27fe1801bbad5aee92ec80b329b3d765c3137
MD5 33f0db9213c395480a6a91383572d1fb
BLAKE2b-256 c56df73cb958d49ebc12109f07b392cd591b1c34b9c81f87fd1022c924f22f45

See more details on using hashes here.

Provenance

The following attestation bundles were made for ai5labs_relay-0.2.1.tar.gz:

Publisher: release.yml on ai5labs/relay-llm

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file ai5labs_relay-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: ai5labs_relay-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 143.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for ai5labs_relay-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 5decefd53a4ba5d41bda1fbc46b7d3cd9b18fcf4836670e1f52e15d91f83a502
MD5 6b90d7fb2ad60696c4e9fb86621550c3
BLAKE2b-256 cb66de12894009dec8486350769e1a3fd4c656a4d861e11d344cae342f903e8f

See more details on using hashes here.

Provenance

The following attestation bundles were made for ai5labs_relay-0.2.1-py3-none-any.whl:

Publisher: release.yml on ai5labs/relay-llm

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page