Auto throughput probing and admission control for FastAPI endpoints
Project description
fastapi-balancer
A Python library for auto-probing the throughput ceiling of FastAPI endpoints and enforcing admission control to prevent overload. It measures how many concurrent requests your endpoint can safely handle, then enforces that limit at runtime by queuing excess requests rather than dropping them.
Table of Contents
- Overview
- How It Works
- Installation
- Quick Start
- Usage
- BenchBalancer
- Built-in Endpoints
- Parameters
- License
Overview
When running a FastAPI app with compute-heavy endpoints (LLM inference, scoring models, image processing), sending too many concurrent requests causes latency to spike, memory to exhaust, and the service to become unresponsive. Traditional solutions require manual tuning or external infrastructure.
fastapi-balancer solves this by:
- Auto-probing — at startup, it sends increasing levels of concurrent requests to your endpoint and finds the maximum concurrency at which it remains stable (within your error rate and latency thresholds).
- Admission control — a middleware intercepts every incoming request to watched endpoints, admits up to the measured capacity, and queues the rest. Requests only fail with a
504if they wait longer thanqueue_timeoutseconds — no requests are silently dropped. - Cross-worker coordination — when running multiple uvicorn workers, a Redis-backed counter ensures the global in-flight count stays within capacity across all processes.
- Dashboard — a built-in web UI (configurable path, default
/balancer/ui) shows live capacity, active requests, queue depth, and a 60-second time-series chart per endpoint.
How It Works
Throughput Probing
The prober sends batches of concurrent requests to each watched endpoint using step-up concurrency levels: 1, 2, 4, 8, 16, 32, 64, 128, ... At each level it measures:
- Error rate (non-2xx responses)
- p99 latency
- RPS
Probing stops when either error_threshold or latency_threshold_ms is exceeded. A binary search then refines the result between the last passing level and the first failing level. The result (max_concurrency) is saved to storage and used as the admission cap.
Admission Control
Every request to a watched endpoint goes through the middleware:
- Check backend health — return
503if unhealthy. - Attempt to acquire a slot (increment active counter).
- If slots are full — place request in a FIFO queue, wait up to
queue_timeoutseconds. - If queue wait expires — return
504 Gateway Timeout. - Otherwise — pass request to the actual handler, release slot on completion.
Request Flow
Incoming request
|
v
BalancerMiddleware
|-- path not watched? --> pass through unchanged
|-- backend unhealthy? --> 503
|-- slot available? --> admit, call handler, release slot
|-- slots full --> FIFO queue, wait for release
|-- queue timeout --> 504
Installation
From PyPI
uv add fastapi-balancer
The published wheel includes the pre-built dashboard. No Node.js or pnpm required.
From source
Clone the repository and build the wheel. The build step automatically runs pnpm install && pnpm build inside dashboard/ and bundles the output into the package — pnpm must be available on your PATH.
git clone https://github.com/hienhayho/fastapi_balancer.git
cd fastapi_balancer
uv build
uv add dist/fastapi_balancer-*.whl
To skip the dashboard (no Node.js needed), remove or rename the dashboard/ directory before building — the build hook silently skips the frontend step when the directory is absent.
Local development (editable install)
git clone https://github.com/hienhayho/fastapi_balancer.git
cd fastapi_balancer
uv sync --extra dev
In editable mode the app serves the dashboard directly from dashboard/dist/ — run the frontend build separately when you want the UI:
cd dashboard
pnpm install
pnpm build
Redis is a required dependency. Ensure a Redis server is accessible when using multi-worker deployments.
Quick Start
from fastapi import FastAPI
from fastapi_balancer import Balancer, BalancerConfig, StorageConfig, StorageType
from fastapi_balancer.models import EndpointProbeConfig
app = FastAPI()
# ... include your routers ...
balancer = Balancer(
config=BalancerConfig(
storage=StorageConfig(type=StorageType.REDIS, url="redis://localhost:6379"),
probe_on_startup=True,
queue_timeout=30.0,
endpoints={
"/predict": EndpointProbeConfig(
method="POST",
headers={"Authorization": "Bearer your-token"},
body={"input": "sample"},
)
},
)
)
balancer.wrap(app)
On startup the balancer probes /predict, stores its capacity, and begins enforcing the limit on every incoming request.
Usage
Inline Config
from fastapi_balancer import Balancer, BalancerConfig, RoutingStrategy, StorageConfig, StorageType, UIConfig
from fastapi_balancer.models import EndpointProbeConfig
balancer = Balancer(
config=BalancerConfig(
storage=StorageConfig(type=StorageType.REDIS, url="redis://localhost:6379"),
routing_strategy=RoutingStrategy.ROUND_ROBIN,
probe_on_startup=True,
queue_timeout=60.0,
latency_threshold_ms=10000,
ui=UIConfig(username="admin", password="secret"),
endpoints={
"/ai_score": EndpointProbeConfig(
method="POST",
headers={
"Content-Type": "application/json",
"Authorization": "Bearer your-token",
},
body={"inputs": [...], "language": "en"},
)
},
)
)
balancer.wrap(app)
YAML Config
balancer = Balancer(config="balancer.yml")
balancer.wrap(app)
Example balancer.yml:
storage:
type: redis
url: redis://localhost:6379
routing_strategy: round-robin
health_endpoint: /health
health_check_interval: 10
queue_timeout: 30
probe_on_startup: true
force_reprobe: false
error_threshold: 0.05
latency_threshold_ms: 2000
ui:
enable: true
path: /balancer/ui
username: admin
password: secret
endpoints:
/predict:
method: POST
headers:
Content-Type: application/json
Authorization: Bearer your-token
body:
input: "sample input"
/embed:
method: POST
headers:
Content-Type: application/json
body:
texts: ["hello"]
See balancer.yml.example for the full annotated template.
Manual Capacity
If you already know the safe concurrency for an endpoint, set capacity directly to skip probing entirely. This works even when probe_on_startup=True.
endpoints={
"/predict": EndpointProbeConfig(
method="POST",
capacity=10, # use this, skip probing
)
}
Or in YAML:
endpoints:
/predict:
method: POST
capacity: 10
Per-Endpoint Queue Timeout
Override the global queue_timeout for a specific endpoint using EndpointProbeConfig.queue_timeout. Useful when different endpoints have different client patience levels.
endpoints={
"/predict": EndpointProbeConfig(
method="POST",
queue_timeout=120.0, # long-running inference — wait up to 2 minutes
),
"/health": EndpointProbeConfig(
method="GET",
queue_timeout=5.0, # health checks should fail fast
),
}
Or in YAML:
queue_timeout: 30 # global default
endpoints:
/predict:
method: POST
queue_timeout: 120
/health:
method: GET
queue_timeout: 5
Multi-Worker Deployment
When running uvicorn with multiple workers, each worker is an independent process. In-memory storage cannot be shared across processes. Use Redis so all workers share a single global counter:
uvicorn main:app --workers 4
BalancerConfig(
storage=StorageConfig(type=StorageType.REDIS, url="redis://localhost:6379")
)
Without Redis, each worker probes independently and tracks its own active count — leading to the total in-flight count being workers x capacity instead of capacity.
With Redis:
- The first worker to start probes the endpoint and writes the result.
- Subsequent workers see the result already in Redis and skip probing.
- All workers share a single atomic counter for active requests.
If you need to force all workers to re-probe (e.g. after a model change), set force_reprobe=True for one restart, then remove it.
Dashboard UI
The dashboard is mounted at ui.path (default /balancer/ui) and auto-detects the API base URL from window.location.origin.
To protect it with a password:
UIConfig(username="admin", password="secret")
To change the mount path:
UIConfig(path="/dashboard", username="admin", password="secret")
To disable the dashboard entirely:
UIConfig(enable=False)
The browser will show its native Basic Auth popup when credentials are configured. Without username/password, the UI is open.
The dashboard polls /balancer/stats every 2 seconds and shows:
- Per-endpoint capacity, active requests, available slots, utilization percentage
- Health status (green / yellow at 80% / red at 100%)
- Queue depth badge when active requests exceed capacity
- 60-second time-series chart of active requests per endpoint
BenchBalancer
BenchBalancer is a standalone tool for probing a live API from outside — without wrapping a FastAPI app. Run it as a script before deploying, then use the generated YAML in production with probe_on_startup=False.
import asyncio
from fastapi_balancer import BenchBalancer, StorageConfig, StorageType, UIConfig
from fastapi_balancer.models import EndpointProbeConfig
asyncio.run(
BenchBalancer(
base_url="http://localhost:8005",
endpoints={
"/ai_score": EndpointProbeConfig(
method="POST",
headers={"Authorization": "Bearer your-token"},
body={"inputs": [...], "language": "en"},
)
},
storage=StorageConfig(type=StorageType.REDIS, url="redis://localhost:6379"),
latency_threshold_ms=80000,
error_threshold=0.05,
queue_timeout=90.0,
ui=UIConfig(username="admin", password="secret"),
).run("balancer.yml")
)
run() probes each endpoint against the live server, measures max_concurrency, writes the result into the YAML as capacity, then saves the file. The output YAML is ready to be passed directly to Balancer(config="balancer.yml").
To export config without probing (e.g. just serialize existing settings):
bench = BenchBalancer(base_url="http://localhost:8005", endpoints={...})
bench.to_yaml("balancer.yml")
Built-in Endpoints
These endpoints are automatically registered on the wrapped app:
| Endpoint | Method | Description |
|---|---|---|
/balancer/stats |
GET | JSON with capacity, active requests, available slots per endpoint |
<ui.path> |
GET | Dashboard web UI (default /balancer/ui) |
/balancer/stats response shape
{
"endpoints": {
"/predict": {
"capacity": 50,
"active_requests": 12,
"available_slots": 38
}
}
}
Parameters
For a full reference of all configuration parameters, see PARAMS.md.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file fastapi_balancer-0.2.1.tar.gz.
File metadata
- Download URL: fastapi_balancer-0.2.1.tar.gz
- Upload date:
- Size: 76.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fd3ca69cbca8a541ec643eecc495d81cb6ef7b0a9637b0d18cd6a6551b87ca8d
|
|
| MD5 |
ec9f78e174f57afcf9b2bb6c35a5a6ee
|
|
| BLAKE2b-256 |
b36c78c673370bf8d922a5951e8703a612481fd637da5fbada564f8ea4d87c2c
|
File details
Details for the file fastapi_balancer-0.2.1-py3-none-any.whl.
File metadata
- Download URL: fastapi_balancer-0.2.1-py3-none-any.whl
- Upload date:
- Size: 176.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
46298d98c15c782d968400ce10f231542c3dc6f261472bab3693f2c0ef054504
|
|
| MD5 |
cad021f421299090b196c6e55f717623
|
|
| BLAKE2b-256 |
2c138727288b5abfa2d3c14efabc903f202977bcef9737f2e491265d6d3a8898
|