Skip to main content

Exposes requests-in-queue metrics for ASGI/FastAPI servers to enable accurate autoscaling

Project description

asgi-runway

Requests-in-queue metrics for ASGI/FastAPI — the right signal for pod autoscaling.


Why not CPU / Memory / RPS?

Signal Problem
CPU / Memory Reactive. The pod is already overloaded before the metric crosses the threshold.
Requests Per Second (RPS) Misleading. 10 req/s @ 50 ms latency = 0.5 in-flight. 10 req/s @ 5 s latency = 50 in-flight. Same RPS, wildly different load.
Requests In Flight Directly measures backlog. Based on Little's Law (L = λW): combines throughput and latency into one number. Scale up when this exceeds your pod's target concurrency.

Installation

pip install asgi-runway

Quick start

from fastapi import FastAPI
from asgi_runway import RunwayMiddleware, metrics_router

app = FastAPI()
app.add_middleware(RunwayMiddleware)
app.include_router(metrics_router)  # exposes GET /metrics

Or use the one-liner:

from asgi_runway import setup
setup(app)

Metrics exposed

Metric Type Description
runway_requests_in_flight Gauge Primary autoscaling signal. Requests currently being processed.
runway_requests_in_flight_by_route Gauge (labelled) Per-route-group breakdown (opt-in).
runway_requests_total Counter Total requests by method + status.
runway_request_duration_seconds Histogram Latency by method.

Per-route granularity

from asgi_runway import setup

setup(
    app,
    route_groups=[
        (r"^/api/infer", "inference"),   # heavy GPU work
        (r"^/api/embed", "embedding"),   # lighter work
    ],
)

This populates runway_requests_in_flight_by_route{route="inference"} so you can scale inference and embedding deployments independently.

Excluding paths

Health checks and the metrics endpoint itself are excluded by default (/metrics, /healthz, /health). Override with:

app.add_middleware(RunwayMiddleware, exclude_paths=["/ping", "/metrics"])

Finding your autoscaling threshold

The threshold is the maximum number of in-flight requests a single pod can handle before latency degrades. Setting it too high means requests pile up before scaling kicks in; too low means you over-provision.

The formula (Little's Law)

threshold = target_RPS_per_pod × target_p95_latency_in_seconds

Example: you want each pod to serve 50 req/s with p95 < 200ms:

threshold = 50 × 0.2 = 10

Finding it empirically with the sweep tool

The repo includes a sweep tool that steps through increasing concurrency levels, measures latency at each, and tells you where your server saturates:

pip install "asgi-runway[dev]"

# For async I/O workloads (DB queries, external API calls)
# Won't saturate — prints the Little's Law formula for your numbers
python -m examples.load_test --mode sweep

# For CPU-bound workloads (sync routes, heavy computation)
# Will saturate — prints the recommended threshold directly
python -m examples.load_test --mode sweep --workload cpu --duration-ms 200

Example output for a CPU-bound endpoint:

  conc      p50      p95      p99    throughput  status
  ──────────────────────────────────────────────────────────────
       1   0.152s   0.152s   0.152s     6.6 req/s  ✓ ok
       2   0.167s   0.215s   0.215s     9.3 req/s  ⚠ degrading
       4   0.271s   0.340s   0.340s    11.8 req/s  ✗ saturated
       8   0.358s   0.584s   0.584s    13.7 req/s  ⚠ degrading
      16   0.901s   1.184s   1.184s    13.5 req/s  ⚠ degrading

  Saturation point : ~4 concurrent requests
  Recommended autoscaling threshold : 3  (75% of saturation)

Saturation is detected when p95 crosses 2× its baseline. The recommended threshold is 75% of the saturation point — this gives pods time to scale up before they hit the wall (new pods take 30–60s to start).

By workload type

Workload How to find threshold
Async I/O (DB, HTTP calls) target_RPS × target_p95s. Run sweep to confirm no saturation.
CPU-bound (sync routes) Run sweep with --workload cpu. Roughly equals thread pool size (min(32, cpu_count + 4)).
ML inference (GPU) Usually equals your batch size × pipeline depth. Measure with sweep against your real model endpoint.

The right KEDA query

Don't use raw sum(runway_requests_in_flight) — that scales based on total load, not per-pod load. Use average:

sum(runway_requests_in_flight) / count(up{job="your-app"})

This means: "scale when the average pod is handling more than N requests", which is what you actually want.

Kubernetes autoscaling

KEDA (recommended)

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: my-api-scaledobject
spec:
  scaleTargetRef:
    name: my-api
  minReplicaCount: 1
  maxReplicaCount: 20
  triggers:
    - type: prometheus
      metadata:
        serverAddress: http://prometheus.monitoring:9090
        metricName: runway_requests_in_flight
        # Scale when the average pod exceeds 10 in-flight requests.
        # Use the per-pod average, not raw sum — see "Finding your threshold" above.
        query: sum(runway_requests_in_flight) / count(up{job="my-api"})
        threshold: "10"

Kubernetes HPA with custom metrics

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
  metrics:
    - type: External
      external:
        metric:
          name: runway_requests_in_flight
        target:
          type: AverageValue
          averageValue: "10"   # target 10 in-flight per pod

Decoupling the metrics server

When the application is overloaded, its event loop may be saturated, causing /metrics scrape requests to time out precisely when you need them most — right before the autoscaler would fire.

The solution is to serve metrics from a server that does not share the application's event loop. asgi-runway offers two options depending on your deployment.

Option A — Embedded metrics thread (plain Docker / EC2 / single container)

Pass metrics_port to setup(). A ThreadingHTTPServer starts in a background daemon thread — no uvicorn, no asyncio, fully independent:

from asgi_runway import setup

setup(app, metrics_port=9091, include_metrics_route=False)
  • Prometheus scrapes port 9091. App traffic goes to port 8000.
  • The metrics thread is isolated from the event loop, so it cannot be blocked by in-flight application requests.
  • Works for both single-process uvicorn and multiprocess (gunicorn + uvicorn workers) — no shared directory required for single-process.
┌─────────────────────────────────────────────────┐
│  Single container                               │
│                                                 │
│  ┌──────────────────────┐                       │
│  │  uvicorn (port 8000) │  ← app traffic        │
│  │  asyncio event loop  │                       │
│  └──────────────────────┘                       │
│                                                 │
│  ┌──────────────────────┐                       │
│  │  metrics thread      │  ← Prometheus scrapes │
│  │  (port 9091)         │    this port          │
│  │  ThreadingHTTPServer │                       │
│  └──────────────────────┘                       │
└─────────────────────────────────────────────────┘

Option B — Sidecar process (Docker Compose / Kubernetes / ECS)

Run the exporter as a separate container alongside the app. Both containers share PROMETHEUS_MULTIPROC_DIR as a mounted volume. The exporter reads the metric files and serves them on its own port — no code in the app server.

# Requires PROMETHEUS_MULTIPROC_DIR to be set and shared
python -m asgi_runway.exporter --port 9091
┌────────────────────────────────────────────────────────────┐
│  Pod / task                                                │
│                                                            │
│  ┌─────────────────────┐   ┌──────────────────────────┐   │
│  │  uvicorn (port 8000)│   │  runway-exporter (9091)  │   │
│  │  RunwayMiddleware   │   │  python -m               │   │
│  │  writes metric files│   │  asgi_runway.exporter    │   │
│  │  to shared volume ──┼───┼──► reads metric files    │   │
│  └─────────────────────┘   └──────────────────────────┘   │
│           │                           │                    │
│      app traffic                 Prometheus scrapes        │
└────────────────────────────────────────────────────────────┘

Docker Compose:

version: "3"
services:
  app:
    image: my-api
    ports:
      - "8000:8000"
    environment:
      PROMETHEUS_MULTIPROC_DIR: /tmp/prom
    volumes:
      - prom_data:/tmp/prom
    command: uvicorn app:app --host 0.0.0.0 --port 8000

  runway-exporter:
    image: my-api          # same image, different entrypoint
    ports:
      - "9091:9091"
    environment:
      PROMETHEUS_MULTIPROC_DIR: /tmp/prom
    volumes:
      - prom_data:/tmp/prom
    command: python -m asgi_runway.exporter --port 9091

volumes:
  prom_data:

Kubernetes sidecar container:

containers:
  - name: app
    image: my-api
    ports:
      - containerPort: 8000
    env:
      - name: PROMETHEUS_MULTIPROC_DIR
        value: /tmp/prom
    volumeMounts:
      - name: prom-dir
        mountPath: /tmp/prom

  - name: runway-exporter
    image: my-api
    command: ["python", "-m", "asgi_runway.exporter", "--port", "9091"]
    ports:
      - containerPort: 9091
    env:
      - name: PROMETHEUS_MULTIPROC_DIR
        value: /tmp/prom
    volumeMounts:
      - name: prom-dir
        mountPath: /tmp/prom

volumes:
  - name: prom-dir
    emptyDir: {}

Which option to use?

  • Single container (EC2, plain Docker): use Option A (metrics_port).
  • Multiple containers (Docker Compose, Kubernetes, ECS): use Option B (sidecar) with a shared volume, so that gunicorn workers across the pod are all aggregated by the exporter.

Multi-process mode (gunicorn + uvicorn workers)

Set the env var before starting the server — prometheus_client handles the rest:

export PROMETHEUS_MULTIPROC_DIR=/tmp/prometheus_multiproc
mkdir -p $PROMETHEUS_MULTIPROC_DIR
gunicorn app:app -w 4 -k uvicorn.workers.UvicornWorker

runway_requests_in_flight will be automatically summed across all workers (multiprocess_mode="livesum").

How it works

RunwayMiddleware is a raw ASGI middleware (not BaseHTTPMiddleware, which has known streaming issues). It wraps every request:

request arrives → REQUESTS_IN_FLIGHT.inc()
       ↓
  app processes
       ↓
response sent → REQUESTS_IN_FLIGHT.dec()
             → REQUESTS_TOTAL.inc()
             → REQUEST_DURATION_SECONDS.observe()

The try/finally block ensures the gauge is decremented even if the handler raises an exception.

Note: runway_requests_in_flight only measures requests that have entered the middleware. Requests dropped by the OS TCP backlog or rejected by uvicorn's --limit-concurrency are invisible to it. See docs/request-limits.md for the full picture, including recommended production values for all three layers.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

asgi_runway-0.3.1.tar.gz (21.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

asgi_runway-0.3.1-py3-none-any.whl (14.1 kB view details)

Uploaded Python 3

File details

Details for the file asgi_runway-0.3.1.tar.gz.

File metadata

  • Download URL: asgi_runway-0.3.1.tar.gz
  • Upload date:
  • Size: 21.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: Hatch/1.16.4 cpython/3.10.12 HTTPX/0.28.1

File hashes

Hashes for asgi_runway-0.3.1.tar.gz
Algorithm Hash digest
SHA256 abfbde61e1398e271f2bcd11d56986e85c7dbd452dcf1d44c74656006631890c
MD5 2d5a5e65e5ea5a52b32df7937e3ae0f0
BLAKE2b-256 c0d7e5e8396de9227f9cc3f6bae9316ebedf5936a032dc7e7d035139897349a5

See more details on using hashes here.

File details

Details for the file asgi_runway-0.3.1-py3-none-any.whl.

File metadata

  • Download URL: asgi_runway-0.3.1-py3-none-any.whl
  • Upload date:
  • Size: 14.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: Hatch/1.16.4 cpython/3.10.12 HTTPX/0.28.1

File hashes

Hashes for asgi_runway-0.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 03b9b07eca8c6e1795056b4e35a4b5782fc5b53557d06fcc5448a2689f245bb7
MD5 6bce6a06e5a9df749525e6128b147c4c
BLAKE2b-256 2a895610218937600dfd6d2de7344dfaaf9f7639c0e0f9498fe564fe76fea2c7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page