Skip to main content

Exposes requests-in-queue metrics for ASGI/FastAPI servers to enable accurate autoscaling

Project description

asgi-runway

Requests-in-queue metrics for ASGI/FastAPI — the right signal for pod autoscaling.


Why not CPU / Memory / RPS?

Signal Problem
CPU / Memory Reactive. The pod is already overloaded before the metric crosses the threshold.
Requests Per Second (RPS) Misleading. 10 req/s @ 50 ms latency = 0.5 in-flight. 10 req/s @ 5 s latency = 50 in-flight. Same RPS, wildly different load.
Requests In Flight Directly measures backlog. Based on Little's Law (L = λW): combines throughput and latency into one number. Scale up when this exceeds your pod's target concurrency.

Installation

pip install asgi-runway

Quick start

from fastapi import FastAPI
from asgi_runway import RunwayMiddleware, metrics_router

app = FastAPI()
app.add_middleware(RunwayMiddleware)
app.include_router(metrics_router)  # exposes GET /metrics

Or use the one-liner:

from asgi_runway import setup
setup(app)

Metrics exposed

Metric Type Description
runway_requests_in_flight Gauge Primary autoscaling signal. Requests currently being processed.
runway_requests_in_flight_by_route Gauge (labelled) Per-route-group breakdown (opt-in).
runway_requests_total Counter Total requests by method + status.
runway_request_duration_seconds Histogram Latency by method.

Per-route granularity

from asgi_runway import setup

setup(
    app,
    route_groups=[
        (r"^/api/infer", "inference"),   # heavy GPU work
        (r"^/api/embed", "embedding"),   # lighter work
    ],
)

This populates runway_requests_in_flight_by_route{route="inference"} so you can scale inference and embedding deployments independently.

Excluding paths

Health checks and the metrics endpoint itself are excluded by default (/metrics, /healthz, /health). Override with:

app.add_middleware(RunwayMiddleware, exclude_paths=["/ping", "/metrics"])

Finding your autoscaling threshold

The threshold is the maximum number of in-flight requests a single pod can handle before latency degrades. Setting it too high means requests pile up before scaling kicks in; too low means you over-provision.

The formula (Little's Law)

threshold = target_RPS_per_pod × target_p95_latency_in_seconds

Example: you want each pod to serve 50 req/s with p95 < 200ms:

threshold = 50 × 0.2 = 10

Finding it empirically with the sweep tool

The repo includes a sweep tool that steps through increasing concurrency levels, measures latency at each, and tells you where your server saturates:

pip install "asgi-runway[dev]"

# For async I/O workloads (DB queries, external API calls)
# Won't saturate — prints the Little's Law formula for your numbers
python -m examples.load_test --mode sweep

# For CPU-bound workloads (sync routes, heavy computation)
# Will saturate — prints the recommended threshold directly
python -m examples.load_test --mode sweep --workload cpu --duration-ms 200

Example output for a CPU-bound endpoint:

  conc      p50      p95      p99    throughput  status
  ──────────────────────────────────────────────────────────────
       1   0.152s   0.152s   0.152s     6.6 req/s  ✓ ok
       2   0.167s   0.215s   0.215s     9.3 req/s  ⚠ degrading
       4   0.271s   0.340s   0.340s    11.8 req/s  ✗ saturated
       8   0.358s   0.584s   0.584s    13.7 req/s  ⚠ degrading
      16   0.901s   1.184s   1.184s    13.5 req/s  ⚠ degrading

  Saturation point : ~4 concurrent requests
  Recommended autoscaling threshold : 3  (75% of saturation)

Saturation is detected when p95 crosses 2× its baseline. The recommended threshold is 75% of the saturation point — this gives pods time to scale up before they hit the wall (new pods take 30–60s to start).

By workload type

Workload How to find threshold
Async I/O (DB, HTTP calls) target_RPS × target_p95s. Run sweep to confirm no saturation.
CPU-bound (sync routes) Run sweep with --workload cpu. Roughly equals thread pool size (min(32, cpu_count + 4)).
ML inference (GPU) Usually equals your batch size × pipeline depth. Measure with sweep against your real model endpoint.

The right KEDA query

Don't use raw sum(runway_requests_in_flight) — that scales based on total load, not per-pod load. Use average:

sum(runway_requests_in_flight) / count(up{job="your-app"})

This means: "scale when the average pod is handling more than N requests", which is what you actually want.

Kubernetes autoscaling

KEDA (recommended)

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: my-api-scaledobject
spec:
  scaleTargetRef:
    name: my-api
  minReplicaCount: 1
  maxReplicaCount: 20
  triggers:
    - type: prometheus
      metadata:
        serverAddress: http://prometheus.monitoring:9090
        metricName: runway_requests_in_flight
        # Scale when the average pod exceeds 10 in-flight requests.
        # Use the per-pod average, not raw sum — see "Finding your threshold" above.
        query: sum(runway_requests_in_flight) / count(up{job="my-api"})
        threshold: "10"

Kubernetes HPA with custom metrics

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
  metrics:
    - type: External
      external:
        metric:
          name: runway_requests_in_flight
        target:
          type: AverageValue
          averageValue: "10"   # target 10 in-flight per pod

Decoupling the metrics server

When the application is overloaded, its event loop may be saturated, causing /metrics scrape requests to time out precisely when you need them most — right before the autoscaler would fire.

The solution is to serve metrics from a server that does not share the application's event loop. asgi-runway offers two options depending on your deployment.

Option A — Embedded metrics thread (plain Docker / EC2 / single container)

Pass metrics_port to setup(). A ThreadingHTTPServer starts in a background daemon thread — no uvicorn, no asyncio, fully independent:

from asgi_runway import setup

setup(app, metrics_port=9091, include_metrics_route=False)
  • Prometheus scrapes port 9091. App traffic goes to port 8000.
  • The metrics thread is isolated from the event loop, so it cannot be blocked by in-flight application requests.
  • Works for both single-process uvicorn and multiprocess (gunicorn + uvicorn workers) — no shared directory required for single-process.
┌─────────────────────────────────────────────────┐
│  Single container                               │
│                                                 │
│  ┌──────────────────────┐                       │
│  │  uvicorn (port 8000) │  ← app traffic        │
│  │  asyncio event loop  │                       │
│  └──────────────────────┘                       │
│                                                 │
│  ┌──────────────────────┐                       │
│  │  metrics thread      │  ← Prometheus scrapes │
│  │  (port 9091)         │    this port          │
│  │  ThreadingHTTPServer │                       │
│  └──────────────────────┘                       │
└─────────────────────────────────────────────────┘

Option B — Sidecar process (Docker Compose / Kubernetes / ECS)

Run the exporter as a separate container alongside the app. Both containers share PROMETHEUS_MULTIPROC_DIR as a mounted volume. The exporter reads the metric files and serves them on its own port — no code in the app server.

# Requires PROMETHEUS_MULTIPROC_DIR to be set and shared
python -m asgi_runway.exporter --port 9091
┌────────────────────────────────────────────────────────────┐
│  Pod / task                                                │
│                                                            │
│  ┌─────────────────────┐   ┌──────────────────────────┐   │
│  │  uvicorn (port 8000)│   │  runway-exporter (9091)  │   │
│  │  RunwayMiddleware   │   │  python -m               │   │
│  │  writes metric files│   │  asgi_runway.exporter    │   │
│  │  to shared volume ──┼───┼──► reads metric files    │   │
│  └─────────────────────┘   └──────────────────────────┘   │
│           │                           │                    │
│      app traffic                 Prometheus scrapes        │
└────────────────────────────────────────────────────────────┘

Docker Compose:

version: "3"
services:
  app:
    image: my-api
    ports:
      - "8000:8000"
    environment:
      PROMETHEUS_MULTIPROC_DIR: /tmp/prom
    volumes:
      - prom_data:/tmp/prom
    command: uvicorn app:app --host 0.0.0.0 --port 8000

  runway-exporter:
    image: my-api          # same image, different entrypoint
    ports:
      - "9091:9091"
    environment:
      PROMETHEUS_MULTIPROC_DIR: /tmp/prom
    volumes:
      - prom_data:/tmp/prom
    command: python -m asgi_runway.exporter --port 9091

volumes:
  prom_data:

Kubernetes sidecar container:

containers:
  - name: app
    image: my-api
    ports:
      - containerPort: 8000
    env:
      - name: PROMETHEUS_MULTIPROC_DIR
        value: /tmp/prom
    volumeMounts:
      - name: prom-dir
        mountPath: /tmp/prom

  - name: runway-exporter
    image: my-api
    command: ["python", "-m", "asgi_runway.exporter", "--port", "9091"]
    ports:
      - containerPort: 9091
    env:
      - name: PROMETHEUS_MULTIPROC_DIR
        value: /tmp/prom
    volumeMounts:
      - name: prom-dir
        mountPath: /tmp/prom

volumes:
  - name: prom-dir
    emptyDir: {}

Which option to use?

  • Single container (EC2, plain Docker): use Option A (metrics_port).
  • Multiple containers (Docker Compose, Kubernetes, ECS): use Option B (sidecar) with a shared volume, so that gunicorn workers across the pod are all aggregated by the exporter.

Multi-process mode (gunicorn + uvicorn workers)

Set the env var before starting the server — prometheus_client handles the rest:

export PROMETHEUS_MULTIPROC_DIR=/tmp/prometheus_multiproc
mkdir -p $PROMETHEUS_MULTIPROC_DIR
gunicorn app:app -w 4 -k uvicorn.workers.UvicornWorker

runway_requests_in_flight will be automatically summed across all workers (multiprocess_mode="livesum").

How it works

RunwayMiddleware is a raw ASGI middleware (not BaseHTTPMiddleware, which has known streaming issues). It wraps every request:

request arrives → REQUESTS_IN_FLIGHT.inc()
       ↓
  app processes
       ↓
response sent → REQUESTS_IN_FLIGHT.dec()
             → REQUESTS_TOTAL.inc()
             → REQUEST_DURATION_SECONDS.observe()

The try/finally block ensures the gauge is decremented even if the handler raises an exception.

Note: runway_requests_in_flight only measures requests that have entered the middleware. Requests dropped by the OS TCP backlog or rejected by uvicorn's --limit-concurrency are invisible to it. See docs/request-limits.md for the full picture, including recommended production values for all three layers.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

asgi_runway-0.3.0.tar.gz (21.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

asgi_runway-0.3.0-py3-none-any.whl (13.9 kB view details)

Uploaded Python 3

File details

Details for the file asgi_runway-0.3.0.tar.gz.

File metadata

  • Download URL: asgi_runway-0.3.0.tar.gz
  • Upload date:
  • Size: 21.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: Hatch/1.16.4 cpython/3.10.12 HTTPX/0.28.1

File hashes

Hashes for asgi_runway-0.3.0.tar.gz
Algorithm Hash digest
SHA256 c6021f8455dc01230cd0067c76ee96e62cad402d015adfde143ad99771f12930
MD5 cf4000629d1a4030c6d2ce840f42eb74
BLAKE2b-256 6e4d8741fffa3650f042ae524a2eea13e14f83ca39d697754abf4cc4fa387f98

See more details on using hashes here.

File details

Details for the file asgi_runway-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: asgi_runway-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 13.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: Hatch/1.16.4 cpython/3.10.12 HTTPX/0.28.1

File hashes

Hashes for asgi_runway-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2e7f3e6acc05d1ae0489cdda61db54e0666316ee744538cb5f1065cbfa31643f
MD5 9b83566d56acc7677bd92e01071961e9
BLAKE2b-256 137c4dc46894e4c8d5eaddab7944b283c305fcf06455de58408bd0ff00aa60e1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page