Auto throughput probing and admission control for FastAPI endpoints

Project description

fastapi-balancer

A Python library for auto-probing the throughput ceiling of FastAPI endpoints and enforcing admission control to prevent overload. It measures how many concurrent requests your endpoint can safely handle, then enforces that limit at runtime by queuing excess requests rather than dropping them.

Overview
How It Works
Installation
Quick Start
Usage
BenchBalancer
Built-in Endpoints
Parameters
License

Overview

When running a FastAPI app with compute-heavy endpoints (LLM inference, scoring models, image processing), sending too many concurrent requests causes latency to spike, memory to exhaust, and the service to become unresponsive. Traditional solutions require manual tuning or external infrastructure.

fastapi-balancer solves this by:

Auto-probing — at startup, it sends increasing levels of concurrent requests to your endpoint and finds the maximum concurrency at which it remains stable (within your error rate and latency thresholds).
Admission control — a middleware intercepts every incoming request to watched endpoints, admits up to the measured capacity, and queues the rest. Requests only fail with a 504 if they wait longer than queue_timeout seconds — no requests are silently dropped.
Cross-worker coordination — when running multiple uvicorn workers, a Redis-backed counter ensures the global in-flight count stays within capacity across all processes.
Dashboard — a built-in web UI (configurable path, default /balancer/ui) shows live capacity, active requests, queue depth, and a 60-second time-series chart per endpoint.

How It Works

Throughput Probing

The prober sends batches of concurrent requests to each watched endpoint using step-up concurrency levels: 1, 2, 4, 8, 16, 32, 64, 128, ... At each level it measures:

Error rate (non-2xx responses)
p99 latency
RPS

Probing stops when either error_threshold or latency_threshold_ms is exceeded. A binary search then refines the result between the last passing level and the first failing level. The result (max_concurrency) is saved to storage and used as the admission cap.

Admission Control

Every request to a watched endpoint goes through the middleware:

Check backend health — return 503 if unhealthy.
Attempt to acquire a slot (increment active counter).
If slots are full — place request in a FIFO queue, wait up to queue_timeout seconds.
If queue wait expires — return 504 Gateway Timeout.
Otherwise — pass request to the actual handler, release slot on completion.

Request Flow

Incoming request
    |
    v
BalancerMiddleware
    |-- path not watched? --> pass through unchanged
    |-- backend unhealthy? --> 503
    |-- slot available? --> admit, call handler, release slot
    |-- slots full --> FIFO queue, wait for release
    |-- queue timeout --> 504

Installation

From PyPI

uv add fastapi-balancer

The published wheel includes the pre-built dashboard. No Node.js or pnpm required.

From source

Clone the repository and build the wheel. The build step automatically runs pnpm install && pnpm build inside dashboard/ and bundles the output into the package — pnpm must be available on your PATH.

git clone https://github.com/hienhayho/fastapi_balancer.git
cd fastapi_balancer
uv build
uv add dist/fastapi_balancer-*.whl

To skip the dashboard (no Node.js needed), remove or rename the dashboard/ directory before building — the build hook silently skips the frontend step when the directory is absent.

Local development (editable install)

git clone https://github.com/hienhayho/fastapi_balancer.git
cd fastapi_balancer
uv sync --extra dev

In editable mode the app serves the dashboard directly from dashboard/dist/ — run the frontend build separately when you want the UI:

cd dashboard
pnpm install
pnpm build

Redis is a required dependency. Ensure a Redis server is accessible when using multi-worker deployments.

Quick Start

from fastapi import FastAPI
from fastapi_balancer import Balancer, BalancerConfig, StorageConfig, StorageType
from fastapi_balancer.models import EndpointProbeConfig

app = FastAPI()

# ... include your routers ...

balancer = Balancer(
    config=BalancerConfig(
        storage=StorageConfig(type=StorageType.REDIS, url="redis://localhost:6379"),
        probe_on_startup=True,
        queue_timeout=30.0,
        endpoints={
            "/predict": EndpointProbeConfig(
                method="POST",
                headers={"Authorization": "Bearer your-token"},
                body={"input": "sample"},
            )
        },
    )
)

balancer.wrap(app)

On startup the balancer probes /predict, stores its capacity, and begins enforcing the limit on every incoming request.

Usage

Inline Config

from fastapi_balancer import Balancer, BalancerConfig, RoutingStrategy, StorageConfig, StorageType, UIConfig
from fastapi_balancer.models import EndpointProbeConfig

balancer = Balancer(
    config=BalancerConfig(
        storage=StorageConfig(type=StorageType.REDIS, url="redis://localhost:6379"),
        routing_strategy=RoutingStrategy.ROUND_ROBIN,
        probe_on_startup=True,
        queue_timeout=60.0,
        latency_threshold_ms=10000,
        ui=UIConfig(username="admin", password="secret"),
        endpoints={
            "/ai_score": EndpointProbeConfig(
                method="POST",
                headers={
                    "Content-Type": "application/json",
                    "Authorization": "Bearer your-token",
                },
                body={"inputs": [...], "language": "en"},
            )
        },
    )
)

balancer.wrap(app)

YAML Config

balancer = Balancer(config="balancer.yml")
balancer.wrap(app)

Example balancer.yml:

storage:
  type: redis
  url: redis://localhost:6379

routing_strategy: round-robin
health_endpoint: /health
health_check_interval: 10
queue_timeout: 30
probe_on_startup: true
force_reprobe: false
error_threshold: 0.05
latency_threshold_ms: 2000

ui:
  enable: true
  path: /balancer/ui
  username: admin
  password: secret

endpoints:
  /predict:
    method: POST
    headers:
      Content-Type: application/json
      Authorization: Bearer your-token
    body:
      input: "sample input"
  /embed:
    method: POST
    headers:
      Content-Type: application/json
    body:
      texts: ["hello"]

See balancer.yml.example for the full annotated template.

Manual Capacity

If you already know the safe concurrency for an endpoint, set capacity directly to skip probing entirely. This works even when probe_on_startup=True.

endpoints={
    "/predict": EndpointProbeConfig(
        method="POST",
        capacity=10,  # use this, skip probing
    )
}

Or in YAML:

endpoints:
  /predict:
    method: POST
    capacity: 10

Per-Endpoint Queue Timeout

Override the global queue_timeout for a specific endpoint using EndpointProbeConfig.queue_timeout. Useful when different endpoints have different client patience levels.

endpoints={
    "/predict": EndpointProbeConfig(
        method="POST",
        queue_timeout=120.0,  # long-running inference — wait up to 2 minutes
    ),
    "/health": EndpointProbeConfig(
        method="GET",
        queue_timeout=5.0,   # health checks should fail fast
    ),
}

Or in YAML:

queue_timeout: 30  # global default

endpoints:
  /predict:
    method: POST
    queue_timeout: 120
  /health:
    method: GET
    queue_timeout: 5

Multi-Worker Deployment

When running uvicorn with multiple workers, each worker is an independent process. In-memory storage cannot be shared across processes. Use Redis so all workers share a single global counter:

uvicorn main:app --workers 4

BalancerConfig(
    storage=StorageConfig(type=StorageType.REDIS, url="redis://localhost:6379")
)

Without Redis, each worker probes independently and tracks its own active count — leading to the total in-flight count being workers x capacity instead of capacity.

With Redis:

The first worker to start probes the endpoint and writes the result.
Subsequent workers see the result already in Redis and skip probing.
All workers share a single atomic counter for active requests.

If you need to force all workers to re-probe (e.g. after a model change), set force_reprobe=True for one restart, then remove it.

Dashboard UI

The dashboard is mounted at ui.path (default /balancer/ui) and auto-detects the API base URL from window.location.origin.

To protect it with a password:

UIConfig(username="admin", password="secret")

To change the mount path:

UIConfig(path="/dashboard", username="admin", password="secret")

To disable the dashboard entirely:

UIConfig(enable=False)

The browser will show its native Basic Auth popup when credentials are configured. Without username/password, the UI is open.

The dashboard polls /balancer/stats every 2 seconds and shows:

Per-endpoint capacity, active requests, available slots, utilization percentage
Health status (green / yellow at 80% / red at 100%)
Queue depth badge when active requests exceed capacity
60-second time-series chart of active requests per endpoint

BenchBalancer

BenchBalancer is a standalone tool for probing a live API from outside — without wrapping a FastAPI app. Run it as a script before deploying, then use the generated YAML in production with probe_on_startup=False.

import asyncio
from fastapi_balancer import BenchBalancer, StorageConfig, StorageType, UIConfig
from fastapi_balancer.models import EndpointProbeConfig

asyncio.run(
    BenchBalancer(
        base_url="http://localhost:8005",
        endpoints={
            "/ai_score": EndpointProbeConfig(
                method="POST",
                headers={"Authorization": "Bearer your-token"},
                body={"inputs": [...], "language": "en"},
            )
        },
        storage=StorageConfig(type=StorageType.REDIS, url="redis://localhost:6379"),
        latency_threshold_ms=80000,
        error_threshold=0.05,
        queue_timeout=90.0,
        ui=UIConfig(username="admin", password="secret"),
    ).run("balancer.yml")
)

run() probes each endpoint against the live server, measures max_concurrency, writes the result into the YAML as capacity, then saves the file. The output YAML is ready to be passed directly to Balancer(config="balancer.yml").

To export config without probing (e.g. just serialize existing settings):

bench = BenchBalancer(base_url="http://localhost:8005", endpoints={...})
bench.to_yaml("balancer.yml")

Built-in Endpoints

These endpoints are automatically registered on the wrapped app:

Endpoint	Method	Description
`/balancer/stats`	GET	JSON with capacity, active requests, available slots per endpoint
`<ui.path>`	GET	Dashboard web UI (default `/balancer/ui`)

`/balancer/stats` response shape

{
  "endpoints": {
    "/predict": {
      "capacity": 50,
      "active_requests": 12,
      "available_slots": 38
    }
  }
}

Parameters

For a full reference of all configuration parameters, see PARAMS.md.

License

MIT

Project details

Release history Release notifications | RSS feed

This version

0.2.1

Apr 17, 2026

0.1.0

Apr 17, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fastapi_balancer-0.2.1.tar.gz (76.1 kB view details)

Uploaded Apr 17, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

fastapi_balancer-0.2.1-py3-none-any.whl (176.4 kB view details)

Uploaded Apr 17, 2026 Python 3

File details

Details for the file fastapi_balancer-0.2.1.tar.gz.

File metadata

Download URL: fastapi_balancer-0.2.1.tar.gz
Upload date: Apr 17, 2026
Size: 76.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.6.10

File hashes

Hashes for fastapi_balancer-0.2.1.tar.gz
Algorithm	Hash digest
SHA256	`fd3ca69cbca8a541ec643eecc495d81cb6ef7b0a9637b0d18cd6a6551b87ca8d`
MD5	`ec9f78e174f57afcf9b2bb6c35a5a6ee`
BLAKE2b-256	`b36c78c673370bf8d922a5951e8703a612481fd637da5fbada564f8ea4d87c2c`

See more details on using hashes here.

File details

Details for the file fastapi_balancer-0.2.1-py3-none-any.whl.

File metadata

Download URL: fastapi_balancer-0.2.1-py3-none-any.whl
Upload date: Apr 17, 2026
Size: 176.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.6.10

File hashes

Hashes for fastapi_balancer-0.2.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`46298d98c15c782d968400ce10f231542c3dc6f261472bab3693f2c0ef054504`
MD5	`cad021f421299090b196c6e55f717623`
BLAKE2b-256	`2c138727288b5abfa2d3c14efabc903f202977bcef9737f2e491265d6d3a8898`

See more details on using hashes here.

fastapi-balancer 0.2.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

fastapi-balancer

Table of Contents

Overview

How It Works

Throughput Probing

Admission Control

Request Flow

Installation

From PyPI

From source

Local development (editable install)

Quick Start

Usage

Inline Config

YAML Config

Manual Capacity

Per-Endpoint Queue Timeout

Multi-Worker Deployment

Dashboard UI

BenchBalancer

Built-in Endpoints

/balancer/stats response shape

Parameters

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`/balancer/stats` response shape