Exposes requests-in-queue metrics for ASGI/FastAPI servers to enable accurate autoscaling
Project description
asgi-runway
Requests-in-queue metrics for ASGI/FastAPI — the right signal for pod autoscaling.
Why not CPU / Memory / RPS?
| Signal | Problem |
|---|---|
| CPU / Memory | Reactive. The pod is already overloaded before the metric crosses the threshold. |
| Requests Per Second (RPS) | Misleading. 10 req/s @ 50 ms latency = 0.5 in-flight. 10 req/s @ 5 s latency = 50 in-flight. Same RPS, wildly different load. |
| Requests In Flight ✓ | Directly measures backlog. Based on Little's Law (L = λW): combines throughput and latency into one number. Scale up when this exceeds your pod's target concurrency. |
Installation
pip install asgi-runway
Quick start
from fastapi import FastAPI
from asgi_runway import RunwayMiddleware, metrics_router
app = FastAPI()
app.add_middleware(RunwayMiddleware)
app.include_router(metrics_router) # exposes GET /metrics
Or use the one-liner:
from asgi_runway import setup
setup(app)
Metrics exposed
| Metric | Type | Description |
|---|---|---|
runway_requests_in_flight |
Gauge | Primary autoscaling signal. Requests currently being processed. |
runway_requests_in_flight_by_route |
Gauge (labelled) | Per-route-group breakdown (opt-in). |
runway_requests_total |
Counter | Total requests by method + status. |
runway_request_duration_seconds |
Histogram | Latency by method. |
Per-route granularity
from asgi_runway import setup
setup(
app,
route_groups=[
(r"^/api/infer", "inference"), # heavy GPU work
(r"^/api/embed", "embedding"), # lighter work
],
)
This populates runway_requests_in_flight_by_route{route="inference"} so you
can scale inference and embedding deployments independently.
Excluding paths
Health checks and the metrics endpoint itself are excluded by default
(/metrics, /healthz, /health). Override with:
app.add_middleware(RunwayMiddleware, exclude_paths=["/ping", "/metrics"])
Finding your autoscaling threshold
The threshold is the maximum number of in-flight requests a single pod can handle before latency degrades. Setting it too high means requests pile up before scaling kicks in; too low means you over-provision.
The formula (Little's Law)
threshold = target_RPS_per_pod × target_p95_latency_in_seconds
Example: you want each pod to serve 50 req/s with p95 < 200ms:
threshold = 50 × 0.2 = 10
Finding it empirically with the sweep tool
The repo includes a sweep tool that steps through increasing concurrency levels, measures latency at each, and tells you where your server saturates:
pip install "asgi-runway[dev]"
# For async I/O workloads (DB queries, external API calls)
# Won't saturate — prints the Little's Law formula for your numbers
python -m examples.load_test --mode sweep
# For CPU-bound workloads (sync routes, heavy computation)
# Will saturate — prints the recommended threshold directly
python -m examples.load_test --mode sweep --workload cpu --duration-ms 200
Example output for a CPU-bound endpoint:
conc p50 p95 p99 throughput status
──────────────────────────────────────────────────────────────
1 0.152s 0.152s 0.152s 6.6 req/s ✓ ok
2 0.167s 0.215s 0.215s 9.3 req/s ⚠ degrading
4 0.271s 0.340s 0.340s 11.8 req/s ✗ saturated
8 0.358s 0.584s 0.584s 13.7 req/s ⚠ degrading
16 0.901s 1.184s 1.184s 13.5 req/s ⚠ degrading
Saturation point : ~4 concurrent requests
Recommended autoscaling threshold : 3 (75% of saturation)
Saturation is detected when p95 crosses 2× its baseline. The recommended threshold is 75% of the saturation point — this gives pods time to scale up before they hit the wall (new pods take 30–60s to start).
By workload type
| Workload | How to find threshold |
|---|---|
| Async I/O (DB, HTTP calls) | target_RPS × target_p95s. Run sweep to confirm no saturation. |
| CPU-bound (sync routes) | Run sweep with --workload cpu. Roughly equals thread pool size (min(32, cpu_count + 4)). |
| ML inference (GPU) | Usually equals your batch size × pipeline depth. Measure with sweep against your real model endpoint. |
The right KEDA query
Don't use raw sum(runway_requests_in_flight) — that scales based on total load, not per-pod load. Use average:
sum(runway_requests_in_flight) / count(up{job="your-app"})
This means: "scale when the average pod is handling more than N requests", which is what you actually want.
Kubernetes autoscaling
KEDA (recommended)
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: my-api-scaledobject
spec:
scaleTargetRef:
name: my-api
minReplicaCount: 1
maxReplicaCount: 20
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus.monitoring:9090
metricName: runway_requests_in_flight
# Scale when the average pod exceeds 10 in-flight requests.
# Use the per-pod average, not raw sum — see "Finding your threshold" above.
query: sum(runway_requests_in_flight) / count(up{job="my-api"})
threshold: "10"
Kubernetes HPA with custom metrics
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
metrics:
- type: External
external:
metric:
name: runway_requests_in_flight
target:
type: AverageValue
averageValue: "10" # target 10 in-flight per pod
Decoupling the metrics server
When the application is overloaded, its event loop may be saturated, causing
/metrics scrape requests to time out precisely when you need them most —
right before the autoscaler would fire.
The solution is to serve metrics from a server that does not share the application's event loop. asgi-runway offers two options depending on your deployment.
Option A — Embedded metrics thread (plain Docker / EC2 / single container)
Pass metrics_port to setup(). A ThreadingHTTPServer starts in a
background daemon thread — no uvicorn, no asyncio, fully independent:
from asgi_runway import setup
setup(app, metrics_port=9091, include_metrics_route=False)
- Prometheus scrapes port
9091. App traffic goes to port8000. - The metrics thread is isolated from the event loop, so it cannot be blocked by in-flight application requests.
- Works for both single-process uvicorn and multiprocess (gunicorn + uvicorn workers) — no shared directory required for single-process.
┌─────────────────────────────────────────────────┐
│ Single container │
│ │
│ ┌──────────────────────┐ │
│ │ uvicorn (port 8000) │ ← app traffic │
│ │ asyncio event loop │ │
│ └──────────────────────┘ │
│ │
│ ┌──────────────────────┐ │
│ │ metrics thread │ ← Prometheus scrapes │
│ │ (port 9091) │ this port │
│ │ ThreadingHTTPServer │ │
│ └──────────────────────┘ │
└─────────────────────────────────────────────────┘
Option B — Sidecar process (Docker Compose / Kubernetes / ECS)
Run the exporter as a separate container alongside the app. Both containers
share PROMETHEUS_MULTIPROC_DIR as a mounted volume. The exporter reads the
metric files and serves them on its own port — no code in the app server.
# Requires PROMETHEUS_MULTIPROC_DIR to be set and shared
python -m asgi_runway.exporter --port 9091
┌────────────────────────────────────────────────────────────┐
│ Pod / task │
│ │
│ ┌─────────────────────┐ ┌──────────────────────────┐ │
│ │ uvicorn (port 8000)│ │ runway-exporter (9091) │ │
│ │ RunwayMiddleware │ │ python -m │ │
│ │ writes metric files│ │ asgi_runway.exporter │ │
│ │ to shared volume ──┼───┼──► reads metric files │ │
│ └─────────────────────┘ └──────────────────────────┘ │
│ │ │ │
│ app traffic Prometheus scrapes │
└────────────────────────────────────────────────────────────┘
Docker Compose:
version: "3"
services:
app:
image: my-api
ports:
- "8000:8000"
environment:
PROMETHEUS_MULTIPROC_DIR: /tmp/prom
volumes:
- prom_data:/tmp/prom
command: uvicorn app:app --host 0.0.0.0 --port 8000
runway-exporter:
image: my-api # same image, different entrypoint
ports:
- "9091:9091"
environment:
PROMETHEUS_MULTIPROC_DIR: /tmp/prom
volumes:
- prom_data:/tmp/prom
command: python -m asgi_runway.exporter --port 9091
volumes:
prom_data:
Kubernetes sidecar container:
containers:
- name: app
image: my-api
ports:
- containerPort: 8000
env:
- name: PROMETHEUS_MULTIPROC_DIR
value: /tmp/prom
volumeMounts:
- name: prom-dir
mountPath: /tmp/prom
- name: runway-exporter
image: my-api
command: ["python", "-m", "asgi_runway.exporter", "--port", "9091"]
ports:
- containerPort: 9091
env:
- name: PROMETHEUS_MULTIPROC_DIR
value: /tmp/prom
volumeMounts:
- name: prom-dir
mountPath: /tmp/prom
volumes:
- name: prom-dir
emptyDir: {}
Which option to use?
- Single container (EC2, plain Docker): use Option A (
metrics_port).- Multiple containers (Docker Compose, Kubernetes, ECS): use Option B (sidecar) with a shared volume, so that gunicorn workers across the pod are all aggregated by the exporter.
Multi-process mode (gunicorn + uvicorn workers)
Set the env var before starting the server — prometheus_client handles the rest:
export PROMETHEUS_MULTIPROC_DIR=/tmp/prometheus_multiproc
mkdir -p $PROMETHEUS_MULTIPROC_DIR
gunicorn app:app -w 4 -k uvicorn.workers.UvicornWorker
runway_requests_in_flight will be automatically summed across all workers
(multiprocess_mode="livesum").
How it works
RunwayMiddleware is a raw ASGI middleware (not BaseHTTPMiddleware, which
has known streaming issues). It wraps every request:
request arrives → REQUESTS_IN_FLIGHT.inc()
↓
app processes
↓
response sent → REQUESTS_IN_FLIGHT.dec()
→ REQUESTS_TOTAL.inc()
→ REQUEST_DURATION_SECONDS.observe()
The try/finally block ensures the gauge is decremented even if the handler
raises an exception.
Note:
runway_requests_in_flightonly measures requests that have entered the middleware. Requests dropped by the OS TCP backlog or rejected by uvicorn's--limit-concurrencyare invisible to it. See docs/request-limits.md for the full picture, including recommended production values for all three layers.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file asgi_runway-0.2.0.tar.gz.
File metadata
- Download URL: asgi_runway-0.2.0.tar.gz
- Upload date:
- Size: 28.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: Hatch/1.16.4 cpython/3.10.12 HTTPX/0.28.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b5a55dfa276181668b2fb6e7acb6aa52fe2aaedb3f1952e1196ed9ec5efbf6bf
|
|
| MD5 |
5a2ebd03a3abab2a8151b60a0252044a
|
|
| BLAKE2b-256 |
44e398c36c7f998bdda12f1aa4328f3cbb38c845a29e2ed469c3a3420a279229
|
File details
Details for the file asgi_runway-0.2.0-py3-none-any.whl.
File metadata
- Download URL: asgi_runway-0.2.0-py3-none-any.whl
- Upload date:
- Size: 12.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: Hatch/1.16.4 cpython/3.10.12 HTTPX/0.28.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
657c4626f5c4fd35791f9775f424c2e0ed5a70f210c86e3d273e1fef653eb498
|
|
| MD5 |
f0be4937580b760c9607f4870af46ed1
|
|
| BLAKE2b-256 |
84fbe9bfedbd67f2b99ede9f126012e1a13abbb709cb9556fdfa33f82f39be76
|