A multi-backend, prioritization load balancer for OpenAI
Project description
OpenAI Priority Load Balancer
Many AI workloads require using more than one Azure OpenAI instance to prioritize Provisioned Throughput Units (PTUs) and insulate themselves from timeouts. In having worked with customers on Azure OpenAI implementations, there are a few common, desired configurations:
- Distribution of requests over multiple consumption instances to mitigate throttling.
- Prioritize exhaustion of all tokens in a PTU instance with a fallback onto multiple consumption instances.
- Tiered prioritization of multiple consumption instances (e.g. use instances first that are geographically closer).
While the OpenAI Python API library respects HTTP 429 and automatically retries after the requested wait period, the library is not set up to support the aforementioned customer desires. The library does, however, allow for the injection of custom httpx clients. This gave rise to this project.
And while there are other Python OpenAI load balancers freely available, I have not seen one yet that addresses the aforementioned scenarios.
Python OpenAI Load Balancer is injected cleanly into the OpenAI Python API library. The changes between a conventional and a load-balanced Azure OpenAI implementation are few and almost entirely configuration of the backends to be used. You can see a side-by-side example in the aoai.py file in this repo.
Please refer to the GitHub repo for detailed test harnesses for the use cases described below.
Disclaimer
This is a pseudo load-balancer.
When executing this code in parallel, there is no way to distribute requests uniformly across all Azure OpenAI instances. Doing so would require a centralized service, cache, etc. to keep track of a common backends list, but that would also imply a locking mechanism for updates, which would immediately inhibit the performance benefits of the load balancer. Without knowledge of any other python workers, we can only randomize selection of an available backend.
Furthermore, while the load balancer handles retries across available backends, the OpenAI Python API library is not fully insulated from failing on multiple HTTP 429s when all backends are returning HTTP 429s. It is advised to load-test with multiple concurrent python workers to understand how your specific Azure OpenAI instances, your limits, and your load balancer configuration function.
Attribution
This project would not have been possible without the incredible work that @andredewes has done with his Smart Load Balancing for OpenAI Endpoints and Azure API Management. If you use Azure API Management in your infrastructure, I highly recommend you consider his policy.
Prerequisites
It helps to have some familiarity with how the OpenAI Python API library works. If you have used it before, then the code in the aoai.py test harness for this package will look very familiar to you. It's also good to have some knowledge of authentication and identities.
Getting Started
Installing the Package
-
Add
openai_priority_loadbalancer
to your requirements.txt file.openai_priority_loadbalancer
-
Run
pip install -r </path/to/requirements.txt>
.
Importing Classes
Either import the synchronous load balancer:
from openai_priority_loadbalancer import LoadBalancer, Backend
Or import the asynchronous load balancer:
from openai_priority_loadbalancer import AsyncLoadBalancer, Backend
Configuring the Backends and Load Balancer
-
Define a list of backends according to the Load Balancer Configuration section below.
backends: List[Backend] = [ Backend("oai-eastus.openai.azure.com", 1), Backend("oai-southcentralus.openai.azure.com", 1) ]
-
Instantiate the load balancer and inject a new httpx client with the load balancer as the new transport.
lb = LoadBalancer(backends) client = AzureOpenAI( azure_endpoint = f"https://{backends[0].host}", # Must be seeded, so we use the first host. It will get overwritten by the load balancer. azure_ad_token_provider = token_provider, # Your authentication may vary. Please adjust accordingly. api_version = "2024-04-01-preview", http_client = DefaultHttpxClient(transport = lb) # Inject the load balancer as the transport in a new default httpx client )
Using the Load Balancer
As these are the only changes to the OpenAI Python API library implementation, simply execute your python code.
Logging
OpenAI Priority Load Balancer uses Python's logging module. The name of the logger is openai-priority-loadbalancer
.
Distribution of Requests
Across Different Priorities
Requests are made to the highest priority backend that is available. For example:
- Priority 1, when available, will always supersede priority 2.
- Priority 2, when available, will always supersede an unavailable priority 1.
- Priority 3, when available, will always supersede unavailable priorities 1 & 2.
Across Multiple Backends of Same Priority
In the single-requestor model, the distribution of attempts over available backends should be fairly uniform for backends of the same priority.
There is no likelihood of a uniform distribution across available endpoints when running multiple python workers in parallel. In the below example, each terminal is executing 20 requests over two Azure OpenAI instances, both set up with the lowest of tokens-per-minute setting. Available backends are selected randomly (see the first request in each terminal). No sharing of data between the two terminals exists. Recovery takes place, when possible; otherwise, an HTTP 429 is returned to the OpenAI Python API library.
Backoff & Retries
When no backends are available (e.g. all timed out), Python OpenAI Load Balancer returns the soonest retry in seconds determined based on the retry_after
value on each backend.
You may notice a delay in the logs between when the load balancer returns and when the next request is made. In addition to the Retry-After
header value, the OpenAI Python library uses a short exponential backoff.
In this log excerpt, we see that all three backends are timing out. As the standard behavior returns an HTTP 429 from a single backend, we do the same here with the load-balanced approach. This allows the OpenAI Python library to handle the HTTP 429 that it believes it received from a singular backend.
The wait periods are 44 seconds (westus), 4 seconds (eastus), and 7 seconds (southcentralus) in this log. Our logic determines that eastus will become available soonest. Therefore, we return a Retry-After
header with a value of 4
. The OpenAI Python library then adds its exponential backoff (~2 seconds here).
2024-05-11 00:56:32.299477: Request sent to server: https://oai-westus-20240509.openai.azure.com/openai/deployments/gpt-35-turbo-sjk-001/chat/completions?api-version=2024-04-01-preview, Status Code: 429 - FAIL
2024-05-11 00:56:32.299477: Backend oai-westus-20240509.openai.azure.com is throttling. Retry after 44 second(s).
2024-05-11 00:56:32.394350: Request sent to server: https://oai-eastus-20240509.openai.azure.com/openai/deployments/gpt-35-turbo-sjk-001/chat/completions?api-version=2024-04-01-preview, Status Code: 429 - FAIL
2024-05-11 00:56:32.395578: Backend oai-eastus-20240509.openai.azure.com is throttling. Retry after 4 second(s).
2024-05-11 00:56:32.451891: Request sent to server: https://oai-southcentralus-20240509.openai.azure.com/openai/deployments/gpt-35-turbo-sjk-001/chat/completions?api-version=2024-04-01-preview, Status Code: 429 - FAIL
2024-05-11 00:56:32.452883: Backend oai-southcentralus-20240509.openai.azure.com is throttling. Retry after 7 second(s).
2024-05-11 00:56:32.452883: No backends available. Exiting.
2024-05-11 00:56:32.453891: Soonest Retry After: oai-eastus-20240509.openai.azure.com - 4 second(s)
2024-05-11 00:56:38.551672: Backend oai-eastus-20240509.openai.azure.com is no longer throttling.
2024-05-11 00:56:39.851076: Request sent to server: https://oai-eastus-20240509.openai.azure.com/openai/deployments/gpt-35-turbo-sjk-001/chat/completions?api-version=2024-04-01-preview, Status code: 200
Load Balancer Configuration
At its core, the Load Balancer configuration requires one or more backend hosts and a numeric priority starting at 1. Please take note that you define a host, not a URL.
I use a total of three Azure OpenAI instances in three regions. These instances are set up with intentionally small tokens-per-minute (tpm) to trigger HTTP 429s. The standard approach never changes and uses the same host (first in the backend list), which provides a stable comparison to the load-balanced approach. While the number of requests differs per tests below, we issue the same number of requests against standard and load-balanced approaches.
One Backend
This is logically equivalent to what the standard approach does. This configuration does not provide value over the standard approach.
# Define the backends and their priority
backends = [
Backends("oai-eastus-xxxxxxxx.openai.azure.com", 1)
]
Two Backends with Same Priority
Load-balancing evenly between Azure OpenAI instances hedges you against being stalled due to a 429 from a single instance.
# Define the backends and their priority
backends = [
Backends("oai-eastus-xxxxxxxx.openai.azure.com", 1),
Backends("oai-southcentralus-xxxxxxxx.openai.azure.com", 1)
]
Three Backends with Same Priority
Adding a third backend with same priority exacerbates the difference to the standard approach further. Here, we need to use 20 requests to incur more HTTP 429s.
# Define the backends and their priority
backends = [
Backends("oai-eastus-xxxxxxxx.openai.azure.com", 1),
Backends("oai-southcentralus-xxxxxxxx.openai.azure.com", 1),
Backends("oai-westus-xxxxxxxx.openai.azure.com", 1)
]
Three Backends with Two Different Priorities
The most common reason for this approach may well be the prioritization of Provisioned Throughput Units (PTUs). This is a reserved capacity over a period of time that is billed at that reservation and not flexible as consumption instances. Aside from guaranteed capacity, latency is also much more stable. Naturally, this is an instance that you would want to prioritize over all others but allow yourself to have fallbacks if you burst over what the PTU provides.
# Define the backends and their priority
backends = [
Backends("oai-eastus-xxxxxxxx.openai.azure.com", 1),
Backends("oai-southcentralus-xxxxxxxx.openai.azure.com", 2),
Backends("oai-westus-xxxxxxxx.openai.azure.com", 2)
]
Three Backends with Three Different Priorities
An example of this setup may be that most of your assets reside in one region (e.g. East US). It stands to reason that you want to use the Azure OpenAI instance in that region. To hedge yourself against HTTP 429s, you decide to add a second region that's geographically close (e.g. East US 2) as well as a third (e.g. West US).
# Define the backends and their priority
backends = [
Backends("oai-eastus-xxxxxxxx.openai.azure.com", 1),
Backends("oai-southcentralus-xxxxxxxx.openai.azure.com", 2),
Backends("oai-westus-xxxxxxxx.openai.azure.com", 3)
]
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file openai_priority_loadbalancer-1.0.3.tar.gz
.
File metadata
- Download URL: openai_priority_loadbalancer-1.0.3.tar.gz
- Upload date:
- Size: 14.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.12.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1f0ff7a7c4858615e29d8f413b369434ec1e1dc5851f6f6e1a96b9f8a87fb78d |
|
MD5 | 12759adc1d61e155efb6d12324ee16ef |
|
BLAKE2b-256 | 6c05ee264afae0aab4ca85c17b86be67202f51a3b120e258dd1681171db101ac |
File details
Details for the file openai_priority_loadbalancer-1.0.3-py3-none-any.whl
.
File metadata
- Download URL: openai_priority_loadbalancer-1.0.3-py3-none-any.whl
- Upload date:
- Size: 10.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.12.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | bec3065ce72360452381b03efe544dcb8bf516f40c4a5f6bedfe9b2ccbeaabce |
|
MD5 | d9f9fcf82e97e55608e1906fc4a47fca |
|
BLAKE2b-256 | 83544bf41147d722ef1940f24800a699b147f01c77e448320c21309cf24daace |