Skip to main content

Batteries-included loadbalancing client for Azure OpenAI

Project description

Azure Switchboard

Batteries-included, coordination-free client loadbalancing for Azure OpenAI and OpenAI.

uv add azure-switchboard

PyPI - Version License: MIT CI

Overview

azure-switchboard is a Python 3 asyncio library that provides an API-compatible client loadbalancer for Chat Completions. You instantiate a Switchboard with one or more Deployments, and requests are distributed across healthy deployments using the power of two random choices method. Deployments can point at Azure OpenAI (base_url=.../openai/v1/) or OpenAI (base_url=None).

Features

  • API Compatibility: Switchboard.create is a transparently-typed proxy for OpenAI.chat.completions.create.
  • Coordination-Free: The default Two Random Choices algorithm does not require coordination between client instances to achieve excellent load distribution characteristics.
  • Utilization-Aware: TPM/RPM utilization is tracked per model per deployment for use during selection.
  • Batteries Included:
    • Session Affinity: Provide a session_id to route requests in the same session to the same deployment.
    • Automatic Failover: Retries are controlled by a tenacity AsyncRetrying policy (failover_policy).
    • Pluggable Selection: Custom selection algorithms can be provided by passing a callable to the selector parameter on the Switchboard constructor.
    • OpenTelemetry Integration: Built-in metrics for request routing and healthy deployment counts.
  • Lightweight: Small codebase with minimal dependencies: openai, tenacity, wrapt, and opentelemetry-api.

Runnable Example

#!/usr/bin/env python3
#
# To run this, use:
#   uv run --env-file .env tools/readme_example.py
#
# /// script
# requires-python = ">=3.10"
# dependencies = [
#     "azure-switchboard",
# ]
# ///

import asyncio
import os

from azure_switchboard import Deployment, Model, Switchboard

azure_openai_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT")
azure_openai_api_key = os.getenv("AZURE_OPENAI_API_KEY")
openai_api_key = os.getenv("OPENAI_API_KEY")

deployments = []
if azure_openai_endpoint and azure_openai_api_key:
    # create 3 deployments. reusing the endpoint
    # is fine for the purposes of this demo
    for name in ("east", "west", "south"):
        deployments.append(
            Deployment(
                name=name,
                base_url=f"{azure_openai_endpoint}/openai/v1/",
                api_key=azure_openai_api_key,
                models=[Model(name="gpt-4o-mini")],
            )
        )

if openai_api_key:
    deployments.append(
        Deployment(
            name="openai",
            api_key=openai_api_key,
            models=[Model(name="gpt-4o-mini")],
        )
    )

if not deployments:
    raise RuntimeError(
        "Set AZURE_OPENAI_ENDPOINT/AZURE_OPENAI_API_KEY or OPENAI_API_KEY to run this example."
    )


async def main():
    async with Switchboard(deployments=deployments) as sb:
        print("Basic functionality:")
        await basic_functionality(sb)

        print("Session affinity (should warn):")
        await session_affinity(sb)


async def basic_functionality(switchboard: Switchboard):
    # Make a completion request (non-streaming)
    response = await switchboard.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": "Hello, world!"}],
    )

    print("completion:", response.choices[0].message.content)

    # Make a streaming completion request
    stream = await switchboard.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": "Hello, world!"}],
        stream=True,
    )

    print("streaming: ", end="")
    async for chunk in stream:
        if chunk.choices and chunk.choices[0].delta.content:
            print(chunk.choices[0].delta.content, end="", flush=True)

    print()


async def session_affinity(switchboard: Switchboard):
    session_id = "anything"

    # First message will select a random healthy
    # deployment and associate it with the session_id
    r = await switchboard.create(
        session_id=session_id,
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": "Who won the World Series in 2020?"}],
    )

    d1 = switchboard.select_deployment(model="gpt-4o-mini", session_id=session_id)
    print("deployment 1:", d1)
    print("response 1:", r.choices[0].message.content)

    # Follow-up requests with the same session_id will route to the same deployment
    r2 = await switchboard.create(
        session_id=session_id,
        model="gpt-4o-mini",
        messages=[
            {"role": "user", "content": "Who won the World Series in 2020?"},
            {"role": "assistant", "content": r.choices[0].message.content},
            {"role": "user", "content": "Who did they beat?"},
        ],
    )

    print("response 2:", r2.choices[0].message.content)

    # Simulate a failure by marking down the deployment
    d1.models["gpt-4o-mini"].mark_down()

    # A new deployment will be selected for this session_id
    r3 = await switchboard.create(
        session_id=session_id,
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": "Who won the World Series in 2021?"}],
    )

    d2 = switchboard.select_deployment(model="gpt-4o-mini", session_id=session_id)
    print("deployment 2:", d2)
    print("response 3:", r3.choices[0].message.content)
    assert d2 != d1


if __name__ == "__main__":
    asyncio.run(main())

Benchmarks

just bench
uv run --env-file .env tools/bench.py -v -r 1000 -d 10 -e 500
Distributing 1000 requests across 10 deployments
Max inflight requests: 1000

Request 500/1000 completed
Utilization Distribution:
0.000 - 0.200 |   0
0.200 - 0.400 |  10 ..............................
0.400 - 0.600 |   0
0.600 - 0.800 |   0
0.800 - 1.000 |   0
Avg utilization: 0.339 (0.332 - 0.349)
Std deviation: 0.006

{
    'bench_0': {'gpt-4o-mini': {'util': 0.361, 'tpm': '10556/30000', 'rpm': '100/300'}},
    'bench_1': {'gpt-4o-mini': {'util': 0.339, 'tpm': '9819/30000', 'rpm': '100/300'}},
    'bench_2': {'gpt-4o-mini': {'util': 0.333, 'tpm': '9405/30000', 'rpm': '97/300'}},
    'bench_3': {'gpt-4o-mini': {'util': 0.349, 'tpm': '10188/30000', 'rpm': '100/300'}},
    'bench_4': {'gpt-4o-mini': {'util': 0.346, 'tpm': '10210/30000', 'rpm': '99/300'}},
    'bench_5': {'gpt-4o-mini': {'util': 0.341, 'tpm': '10024/30000', 'rpm': '99/300'}},
    'bench_6': {'gpt-4o-mini': {'util': 0.343, 'tpm': '10194/30000', 'rpm': '100/300'}},
    'bench_7': {'gpt-4o-mini': {'util': 0.352, 'tpm': '10362/30000', 'rpm': '102/300'}},
    'bench_8': {'gpt-4o-mini': {'util': 0.35, 'tpm': '10362/30000', 'rpm': '102/300'}},
    'bench_9': {'gpt-4o-mini': {'util': 0.365, 'tpm': '10840/30000', 'rpm': '101/300'}}
}

Utilization Distribution:
0.000 - 0.100 |   0
0.100 - 0.200 |   0
0.200 - 0.300 |   0
0.300 - 0.400 |  10 ..............................
0.400 - 0.500 |   0
0.500 - 0.600 |   0
0.600 - 0.700 |   0
0.700 - 0.800 |   0
0.800 - 0.900 |   0
0.900 - 1.000 |   0
Avg utilization: 0.348 (0.333 - 0.365)
Std deviation: 0.009

Distribution overhead: 926.14ms
Average response latency: 5593.77ms
Total latency: 17565.37ms
Requests per second: 1079.75
Overhead per request: 0.93ms

Distribution overhead scales ~linearly with the number of deployments.

Configuration Reference

switchboard.Model Parameters

Parameter Description Default
name Model name as sent to Chat Completions Required
tpm Tokens-per-minute budget used for utilization tracking and routing 0 (unlimited)
rpm Requests-per-minute budget used for utilization tracking and routing 0 (unlimited)
default_cooldown Cooldown duration (seconds) after a deployment/model failure mark-down 10.0

switchboard.Deployment Parameters

Parameter Description Default
name Unique identifier for the deployment Required
base_url API base URL. Azure example: https://<resource>.openai.azure.com/openai/v1/. OpenAI: leave None. None
api_key API key for the deployment None
timeout Default request timeout (seconds) 600.0
models Models available on this deployment Built-in model name defaults

switchboard.Switchboard Parameters

Parameter Description Default
deployments List of deployment configs Required
selector Deployment selection function (model, eligible_deployments) -> deployment two_random_choices
failover_policy Tenacity AsyncRetrying policy used around each create call AsyncRetrying(stop=stop_after_attempt(2), retry=retry_if_not_exception_type(SwitchboardError), reraise=True)
ratelimit_window How often usage counters reset (seconds). Set 0 to disable periodic reset. 60.0
max_sessions LRU capacity for session affinity map 1024

Development

This project uses uv for package management, and just for task automation. See the justfile for available commands.

git clone https://github.com/arini-ai/azure-switchboard
cd azure-switchboard

just install

Running tests

just test

Release

This library uses CalVer for versioning. On push to master, if tests pass, a package is automatically built, released, and uploaded to PyPI.

Locally, the package can be built with uv:

uv build

OpenTelemetry Integration

azure-switchboard uses OpenTelemetry metrics via the meter azure_switchboard.switchboard.

Metrics emitted on the request path include:

  • healthy_deployments_count (gauge)
  • requests (counter, with deployment + model attributes)

To run with local OTEL instrumentation:

just otel-run

Contributing

  1. Fork/clone repo
  2. Make changes
  3. Run tests with just test
  4. Lint with just lint
  5. Commit and make a PR

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

azure_switchboard-2026.3.0.tar.gz (204.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

azure_switchboard-2026.3.0-py3-none-any.whl (13.2 kB view details)

Uploaded Python 3

File details

Details for the file azure_switchboard-2026.3.0.tar.gz.

File metadata

  • Download URL: azure_switchboard-2026.3.0.tar.gz
  • Upload date:
  • Size: 204.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.10.8 {"installer":{"name":"uv","version":"0.10.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for azure_switchboard-2026.3.0.tar.gz
Algorithm Hash digest
SHA256 daea0855f470fe67e77e6707b615bf37828ffb61e5f1e7c9649f5d15e5237cb0
MD5 3cc7b82902af161672118447b0a8f23c
BLAKE2b-256 73f6b261733ac902fc433b20d0b8db77c50a194b648803cb9b36d4020d410bc5

See more details on using hashes here.

File details

Details for the file azure_switchboard-2026.3.0-py3-none-any.whl.

File metadata

  • Download URL: azure_switchboard-2026.3.0-py3-none-any.whl
  • Upload date:
  • Size: 13.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.10.8 {"installer":{"name":"uv","version":"0.10.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for azure_switchboard-2026.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 dd1eb8e15e4cfd12caea2eea6b9255b6f3ffd5139889780f838af9362342736c
MD5 38f9a131fbd88e3c6d81275d1f17f767
BLAKE2b-256 04e802fdf85a8ea22b914aefa8e12a43eb6aed9db56e2428c332eeff1e9cfe00

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page