Skip to main content

Batteries-included loadbalancing client for Azure OpenAI

Project description

Azure Switchboard

Batteries-included, coordination-free client loadbalancing for Azure OpenAI.

uv add azure-switchboard

PyPI - Version License: MIT CI

Overview

azure-switchboard is a Python 3 asyncio library that provides an intelligent, API-compatible client loadbalancer for Azure OpenAI. You instantiate a Switchboard client with a set of deployments, and the client distributes your chat completion requests across the available deployments using the power of two random choices method. In this sense, it functions as a lightweight service mesh between your application and Azure OpenAI. The basic idea is inspired by ServiceRouter.

Features

  • API Compatibility: Switchboard.create is a transparently-typed drop-in proxy for OpenAI.chat.completions.create.

  • Coordination-Free: The default Two Random Choices algorithm does not require coordination between client instances to achieve excellent load distribution characteristics.

  • Utilization-Aware: TPM/RPM ratelimit utilization is tracked per model per deployment for use during selection.

  • Batteries Included:

    • Session Affinity: Provide a session_id to route requests in the same session to the same deployment, optimizing for prompt caching
    • Automatic Failover: Client automatically retries on request failure, with optional fallback to OpenAI by providing an OpenAIDeployment in deployments. The retry policy can also be customized by passing a tenacity AsyncRetrying instance to failover_policy.
    • Pluggable Selection: Custom selection algorithms can be provided by passing a callable to the selector parameter on the Switchboard constructor.
    • OpenTelemetry Integration: Comprehensive metrics and instrumentation for monitoring deployment health and utilization.
  • Lightweight: sub-400 LOC implementation with minimal dependencies: openai, tenacity, wrapt, and opentelemetry-api. <1ms overhead per request.

  • 100% Test Coverage: Comprehensive test suite with pytest.

Runnable Example

#!/usr/bin/env python3
#
# To run this, use:
#   uv run --env-file .env tools/readme_example.py
#
# /// script
# requires-python = ">=3.10"
# dependencies = [
#     "azure-switchboard",
# ]
# ///

import asyncio
import os

from azure_switchboard import AzureDeployment, Model, OpenAIDeployment, Switchboard

azure_openai_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT")
azure_openai_api_key = os.getenv("AZURE_OPENAI_API_KEY")
openai_api_key = os.getenv("OPENAI_API_KEY", None)

deployments = []
if azure_openai_endpoint and azure_openai_api_key:
    # create 3 deployments. reusing the endpoint
    # is fine for the purposes of this demo
    for name in ("east", "west", "south"):
        deployments.append(
            AzureDeployment(
                name=name,
                endpoint=azure_openai_endpoint,
                api_key=azure_openai_api_key,
                models=[Model(name="gpt-4o-mini")],
            )
        )

if openai_api_key:
    # we can use openai as a fallback deployment
    # it will pick up the api key from the environment
    deployments.append(OpenAIDeployment())


async def main():
    async with Switchboard(deployments=deployments) as sb:
        print("Basic functionality:")
        await basic_functionality(sb)

        print("Session affinity (should warn):")
        await session_affinity(sb)


async def basic_functionality(switchboard: Switchboard):
    # Make a completion request (non-streaming)
    response = await switchboard.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": "Hello, world!"}],
    )

    print("completion:", response.choices[0].message.content)

    # Make a streaming completion request
    stream = await switchboard.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": "Hello, world!"}],
        stream=True,
    )

    print("streaming: ", end="")
    async for chunk in stream:
        if chunk.choices and chunk.choices[0].delta.content:
            print(chunk.choices[0].delta.content, end="", flush=True)

    print()


async def session_affinity(switchboard: Switchboard):
    session_id = "anything"

    # First message will select a random healthy
    # deployment and associate it with the session_id
    r = await switchboard.create(
        session_id=session_id,
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": "Who won the World Series in 2020?"}],
    )

    d1 = switchboard.select_deployment(model="gpt-4o-mini", session_id=session_id)
    print("deployment 1:", d1)
    print("response 1:", r.choices[0].message.content)

    # Follow-up requests with the same session_id will route to the same deployment
    r2 = await switchboard.create(
        session_id=session_id,
        model="gpt-4o-mini",
        messages=[
            {"role": "user", "content": "Who won the World Series in 2020?"},
            {"role": "assistant", "content": r.choices[0].message.content},
            {"role": "user", "content": "Who did they beat?"},
        ],
    )

    print("response 2:", r2.choices[0].message.content)

    # Simulate a failure by marking down the deployment
    d1.models["gpt-4o-mini"].cooldown()

    # A new deployment will be selected for this session_id
    r3 = await switchboard.create(
        session_id=session_id,
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": "Who won the World Series in 2021?"}],
    )

    d2 = switchboard.select_deployment(model="gpt-4o-mini", session_id=session_id)
    print("deployment 2:", d2)
    print("response 3:", r3.choices[0].message.content)
    assert d2 != d1


if __name__ == "__main__":
    asyncio.run(main())

Benchmarks

just bench
uv run --env-file .env tools/bench.py -v -r 1000 -d 10 -e 500
Distributing 1000 requests across 10 deployments
Max inflight requests: 1000

Request 500/1000 completed
Utilization Distribution:
0.000 - 0.200 |   0
0.200 - 0.400 |  10 ..............................
0.400 - 0.600 |   0
0.600 - 0.800 |   0
0.800 - 1.000 |   0
Avg utilization: 0.339 (0.332 - 0.349)
Std deviation: 0.006

{
    'bench_0': {'gpt-4o-mini': {'util': 0.361, 'tpm': '10556/30000', 'rpm': '100/300'}},
    'bench_1': {'gpt-4o-mini': {'util': 0.339, 'tpm': '9819/30000', 'rpm': '100/300'}},
    'bench_2': {'gpt-4o-mini': {'util': 0.333, 'tpm': '9405/30000', 'rpm': '97/300'}},
    'bench_3': {'gpt-4o-mini': {'util': 0.349, 'tpm': '10188/30000', 'rpm': '100/300'}},
    'bench_4': {'gpt-4o-mini': {'util': 0.346, 'tpm': '10210/30000', 'rpm': '99/300'}},
    'bench_5': {'gpt-4o-mini': {'util': 0.341, 'tpm': '10024/30000', 'rpm': '99/300'}},
    'bench_6': {'gpt-4o-mini': {'util': 0.343, 'tpm': '10194/30000', 'rpm': '100/300'}},
    'bench_7': {'gpt-4o-mini': {'util': 0.352, 'tpm': '10362/30000', 'rpm': '102/300'}},
    'bench_8': {'gpt-4o-mini': {'util': 0.35, 'tpm': '10362/30000', 'rpm': '102/300'}},
    'bench_9': {'gpt-4o-mini': {'util': 0.365, 'tpm': '10840/30000', 'rpm': '101/300'}}
}

Utilization Distribution:
0.000 - 0.100 |   0
0.100 - 0.200 |   0
0.200 - 0.300 |   0
0.300 - 0.400 |  10 ..............................
0.400 - 0.500 |   0
0.500 - 0.600 |   0
0.600 - 0.700 |   0
0.700 - 0.800 |   0
0.800 - 0.900 |   0
0.900 - 1.000 |   0
Avg utilization: 0.348 (0.333 - 0.365)
Std deviation: 0.009

Distribution overhead: 926.14ms
Average response latency: 5593.77ms
Total latency: 17565.37ms
Requests per second: 1079.75
Overhead per request: 0.93ms

Distribution overhead scales ~linearly with the number of deployments.

Configuration Reference

switchboard.Model Parameters

Parameter Description Default
name Configured model name, e.g. "gpt-4o" or "gpt-4o-mini" Required
tpm Configured TPM rate limit 0 (unlimited)
rpm Configured RPM rate limit 0 (unlimited)
default_cooldown Default cooldown period in seconds 10.0

switchboard.AzureDeployment Parameters

Parameter Description Default
name Unique identifier for the deployment Required
endpoint Azure OpenAI endpoint URL Required
api_key Azure OpenAI API key Required
api_version Azure OpenAI API version "2024-10-21"
timeout Default timeout in seconds 600.0
models List of Models configured for this deployment Required

switchboard.Switchboard Parameters

Parameter Description Default
deployments List of Deployment config objects Required
selector Selection algorithm two_random_choices
failover_policy Policy for handling failed requests AsyncRetrying(stop=stop_after_attempt(2))
ratelimit_window Ratelimit window in seconds 60.0
max_sessions Maximum number of sessions 1024

Development

This project uses uv for package management, and just for task automation. See the justfile for available commands.

git clone https://github.com/arini-ai/azure-switchboard
cd azure-switchboard

just install

Running tests

just test

Release

This library uses CalVer for versioning. On push to master, if tests pass, a package is automatically built, released, and uploaded to PyPI.

Locally, the package can be built with uv:

uv build

OpenTelemetry Integration

The library provides instrumentation for monitoring deployment health and performance metrics:

(azure-switchboard) .venv > just otel-run
uv run --env-file .env opentelemetry-instrument python tools/bench.py -r 5 -d 3
Distributing 5 requests across 3 deployments
Max inflight requests: 1000

Distribution overhead: 10.53ms
Average response latency: 2164.03ms
Total latency: 3869.06ms
Requests per second: 475.03
Overhead per request: 2.11ms
{
    "resource_metrics": [
        {
            "resource": {
                "attributes": {
                    "telemetry.sdk.language": "python",
                    "telemetry.sdk.name": "opentelemetry",
                    "telemetry.sdk.version": "1.31.0",
                    "service.name": "switchboard",
                    "telemetry.auto.version": "0.52b0"
                },
                "schema_url": ""
            },
            "scope_metrics": [
                {
                    "scope": {
                        "name": "azure_switchboard.deployment",
                        "version": "",
                        "schema_url": "",
                        "attributes": null
                    },
                    "metrics": [
                        {
                            "name": "model_utilization",
                            "description": "Current utilization of a model deployment (0-1)",
                            "unit": "percent",
                            "data": {
                                "data_points": [
                                    {
                                        "attributes": {
                                            "model": "gpt-4o-mini"
                                        },
                                        "start_time_unix_nano": null,
                                        "time_unix_nano": 1742461487509982000,
                                        "value": 0.008,
                                        "exemplars": []
...

Contributing

  1. Fork/clone repo
  2. Make changes
  3. Run tests with just test
  4. Lint with just lint
  5. Commit and make a PR

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

azure_switchboard-2025.6.0.tar.gz (150.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

azure_switchboard-2025.6.0-py3-none-any.whl (12.9 kB view details)

Uploaded Python 3

File details

Details for the file azure_switchboard-2025.6.0.tar.gz.

File metadata

  • Download URL: azure_switchboard-2025.6.0.tar.gz
  • Upload date:
  • Size: 150.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.7.14

File hashes

Hashes for azure_switchboard-2025.6.0.tar.gz
Algorithm Hash digest
SHA256 e28f85bafc9b4be637ca859ebbc49b0b84ddc62b4a7e6f5b930d3e6fb6b38cb1
MD5 21d199fbb6bf35852fa22457030b0343
BLAKE2b-256 2bce971e33df8b33fe5e2ac257103b4220039e265b0b8143b19b2c39b627c78a

See more details on using hashes here.

File details

Details for the file azure_switchboard-2025.6.0-py3-none-any.whl.

File metadata

File hashes

Hashes for azure_switchboard-2025.6.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8dfb48f372c74ed49285f9628628636ac7929cbac6e8f180cf8f1dfa32723a58
MD5 f55871b45c536120364c06b41ee5a462
BLAKE2b-256 8f9b5a73071ff90f48010cb6b8659176a83ea6839d434a134feebc0ff71a207c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page