Skip to main content

Exposes requests-in-queue metrics for ASGI/FastAPI servers to enable accurate autoscaling

Project description

asgi-runway

Requests-in-queue metrics for ASGI/FastAPI — the right signal for pod autoscaling.


Why not CPU / Memory / RPS?

Signal Problem
CPU / Memory Reactive. The pod is already overloaded before the metric crosses the threshold.
Requests Per Second (RPS) Misleading. 10 req/s @ 50 ms latency = 0.5 in-flight. 10 req/s @ 5 s latency = 50 in-flight. Same RPS, wildly different load.
Requests In Flight Directly measures backlog. Based on Little's Law (L = λW): combines throughput and latency into one number. Scale up when this exceeds your pod's target concurrency.

Installation

pip install asgi-runway

Quick start

from fastapi import FastAPI
from asgi_runway import RunwayMiddleware, metrics_router

app = FastAPI()
app.add_middleware(RunwayMiddleware)
app.include_router(metrics_router)  # exposes GET /metrics

Or use the one-liner:

from asgi_runway import setup
setup(app)

Metrics exposed

Metric Type Description
runway_requests_in_flight Gauge Primary autoscaling signal. Requests currently being processed.
runway_requests_in_flight_by_route Gauge (labelled) Per-route-group breakdown (opt-in).
runway_requests_total Counter Total requests by method + status.
runway_request_duration_seconds Histogram Latency by method.

Per-route granularity

from asgi_runway import setup

setup(
    app,
    route_groups=[
        (r"^/api/infer", "inference"),   # heavy GPU work
        (r"^/api/embed", "embedding"),   # lighter work
    ],
)

This populates runway_requests_in_flight_by_route{route="inference"} so you can scale inference and embedding deployments independently.

Excluding paths

Health checks and the metrics endpoint itself are excluded by default (/metrics, /healthz, /health). Override with:

app.add_middleware(RunwayMiddleware, exclude_paths=["/ping", "/metrics"])

Finding your autoscaling threshold

The threshold is the maximum number of in-flight requests a single pod can handle before latency degrades. Setting it too high means requests pile up before scaling kicks in; too low means you over-provision.

The formula (Little's Law)

threshold = target_RPS_per_pod × target_p95_latency_in_seconds

Example: you want each pod to serve 50 req/s with p95 < 200ms:

threshold = 50 × 0.2 = 10

Finding it empirically with the sweep tool

The repo includes a sweep tool that steps through increasing concurrency levels, measures latency at each, and tells you where your server saturates:

pip install "asgi-runway[dev]"

# For async I/O workloads (DB queries, external API calls)
# Won't saturate — prints the Little's Law formula for your numbers
python -m examples.load_test --mode sweep

# For CPU-bound workloads (sync routes, heavy computation)
# Will saturate — prints the recommended threshold directly
python -m examples.load_test --mode sweep --workload cpu --duration-ms 200

Example output for a CPU-bound endpoint:

  conc      p50      p95      p99    throughput  status
  ──────────────────────────────────────────────────────────────
       1   0.152s   0.152s   0.152s     6.6 req/s  ✓ ok
       2   0.167s   0.215s   0.215s     9.3 req/s  ⚠ degrading
       4   0.271s   0.340s   0.340s    11.8 req/s  ✗ saturated
       8   0.358s   0.584s   0.584s    13.7 req/s  ⚠ degrading
      16   0.901s   1.184s   1.184s    13.5 req/s  ⚠ degrading

  Saturation point : ~4 concurrent requests
  Recommended autoscaling threshold : 3  (75% of saturation)

Saturation is detected when p95 crosses 2× its baseline. The recommended threshold is 75% of the saturation point — this gives pods time to scale up before they hit the wall (new pods take 30–60s to start).

By workload type

Workload How to find threshold
Async I/O (DB, HTTP calls) target_RPS × target_p95s. Run sweep to confirm no saturation.
CPU-bound (sync routes) Run sweep with --workload cpu. Roughly equals thread pool size (min(32, cpu_count + 4)).
ML inference (GPU) Usually equals your batch size × pipeline depth. Measure with sweep against your real model endpoint.

The right KEDA query

Don't use raw sum(runway_requests_in_flight) — that scales based on total load, not per-pod load. Use average:

sum(runway_requests_in_flight) / count(up{job="your-app"})

This means: "scale when the average pod is handling more than N requests", which is what you actually want.

Kubernetes autoscaling

KEDA (recommended)

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: my-api-scaledobject
spec:
  scaleTargetRef:
    name: my-api
  minReplicaCount: 1
  maxReplicaCount: 20
  triggers:
    - type: prometheus
      metadata:
        serverAddress: http://prometheus.monitoring:9090
        metricName: runway_requests_in_flight
        # Scale when the average pod exceeds 10 in-flight requests.
        # Use the per-pod average, not raw sum — see "Finding your threshold" above.
        query: sum(runway_requests_in_flight) / count(up{job="my-api"})
        threshold: "10"

Kubernetes HPA with custom metrics

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
  metrics:
    - type: External
      external:
        metric:
          name: runway_requests_in_flight
        target:
          type: AverageValue
          averageValue: "10"   # target 10 in-flight per pod

Multi-process mode (gunicorn + uvicorn workers)

Set the env var before starting the server — prometheus_client handles the rest:

export PROMETHEUS_MULTIPROC_DIR=/tmp/prometheus_multiproc
mkdir -p $PROMETHEUS_MULTIPROC_DIR
gunicorn app:app -w 4 -k uvicorn.workers.UvicornWorker

runway_requests_in_flight will be automatically summed across all workers (multiprocess_mode="livesum").

How it works

RunwayMiddleware is a raw ASGI middleware (not BaseHTTPMiddleware, which has known streaming issues). It wraps every request:

request arrives → REQUESTS_IN_FLIGHT.inc()
       ↓
  app processes
       ↓
response sent → REQUESTS_IN_FLIGHT.dec()
             → REQUESTS_TOTAL.inc()
             → REQUEST_DURATION_SECONDS.observe()

The try/finally block ensures the gauge is decremented even if the handler raises an exception.

Note: runway_requests_in_flight only measures requests that have entered the middleware. Requests dropped by the OS TCP backlog or rejected by uvicorn's --limit-concurrency are invisible to it. See docs/request-limits.md for the full picture, including recommended production values for all three layers.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

asgi_runway-0.1.0.tar.gz (16.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

asgi_runway-0.1.0-py3-none-any.whl (8.6 kB view details)

Uploaded Python 3

File details

Details for the file asgi_runway-0.1.0.tar.gz.

File metadata

  • Download URL: asgi_runway-0.1.0.tar.gz
  • Upload date:
  • Size: 16.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: Hatch/1.16.4 cpython/3.10.12 HTTPX/0.28.1

File hashes

Hashes for asgi_runway-0.1.0.tar.gz
Algorithm Hash digest
SHA256 2f1b772b66c7327ef15e7126a0dd09fbe10cfb18a88a989c5ab3311846f51ad8
MD5 1954070abf4d3005db4f7136a8104c7f
BLAKE2b-256 76a162981662a8c949c3677aab5f3a428ba2f5bec864438436dad1cb4af0459a

See more details on using hashes here.

File details

Details for the file asgi_runway-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: asgi_runway-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 8.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: Hatch/1.16.4 cpython/3.10.12 HTTPX/0.28.1

File hashes

Hashes for asgi_runway-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 69c18a3bd1353b47a982282afb31c351f78810ccac69304a03dda259371ed5bd
MD5 df8c7929add71a5c1e09a610c6617b8f
BLAKE2b-256 3eea60c6a7b466f0c4d7d5963463526793919d78da93a1e805a6fea40f1ac116

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page