Exposes requests-in-queue metrics for ASGI/FastAPI servers to enable accurate autoscaling

These details have not been verified by PyPI

Project description

asgi-runway

Requests-in-queue metrics for ASGI/FastAPI — the right signal for pod autoscaling.

Why not CPU / Memory / RPS?

Signal	Problem
CPU / Memory	Reactive. The pod is already overloaded before the metric crosses the threshold.
Requests Per Second (RPS)	Misleading. 10 req/s @ 50 ms latency = 0.5 in-flight. 10 req/s @ 5 s latency = 50 in-flight. Same RPS, wildly different load.
Requests In Flight ✓	Directly measures backlog. Based on Little's Law (L = λW): combines throughput and latency into one number. Scale up when this exceeds your pod's target concurrency.

Installation

pip install asgi-runway

Quick start

from fastapi import FastAPI
from asgi_runway import RunwayMiddleware, metrics_router

app = FastAPI()
app.add_middleware(RunwayMiddleware)
app.include_router(metrics_router)  # exposes GET /metrics

Or use the one-liner:

from asgi_runway import setup
setup(app)

Metrics exposed

Metric	Type	Description
`runway_requests_in_flight`	Gauge	Primary autoscaling signal. Requests currently being processed.
`runway_requests_in_flight_by_route`	Gauge (labelled)	Per-route-group breakdown (opt-in).
`runway_requests_total`	Counter	Total requests by `method` + `status`.
`runway_request_duration_seconds`	Histogram	Latency by `method`.

Per-route granularity

from asgi_runway import setup

setup(
    app,
    route_groups=[
        (r"^/api/infer", "inference"),   # heavy GPU work
        (r"^/api/embed", "embedding"),   # lighter work
    ],
)

This populates runway_requests_in_flight_by_route{route="inference"} so you can scale inference and embedding deployments independently.

Excluding paths

Health checks and the metrics endpoint itself are excluded by default (/metrics, /healthz, /health). Override with:

app.add_middleware(RunwayMiddleware, exclude_paths=["/ping", "/metrics"])

Finding your autoscaling threshold

The threshold is the maximum number of in-flight requests a single pod can handle before latency degrades. Setting it too high means requests pile up before scaling kicks in; too low means you over-provision.

The formula (Little's Law)

threshold = target_RPS_per_pod × target_p95_latency_in_seconds

Example: you want each pod to serve 50 req/s with p95 < 200ms:

threshold = 50 × 0.2 = 10

Finding it empirically with the sweep tool

The repo includes a sweep tool that steps through increasing concurrency levels, measures latency at each, and tells you where your server saturates:

pip install "asgi-runway[dev]"

# For async I/O workloads (DB queries, external API calls)
# Won't saturate — prints the Little's Law formula for your numbers
python -m examples.load_test --mode sweep

# For CPU-bound workloads (sync routes, heavy computation)
# Will saturate — prints the recommended threshold directly
python -m examples.load_test --mode sweep --workload cpu --duration-ms 200

Example output for a CPU-bound endpoint:

  conc      p50      p95      p99    throughput  status
  ──────────────────────────────────────────────────────────────
       1   0.152s   0.152s   0.152s     6.6 req/s  ✓ ok
       2   0.167s   0.215s   0.215s     9.3 req/s  ⚠ degrading
       4   0.271s   0.340s   0.340s    11.8 req/s  ✗ saturated
       8   0.358s   0.584s   0.584s    13.7 req/s  ⚠ degrading
      16   0.901s   1.184s   1.184s    13.5 req/s  ⚠ degrading

  Saturation point : ~4 concurrent requests
  Recommended autoscaling threshold : 3  (75% of saturation)

Saturation is detected when p95 crosses 2× its baseline. The recommended threshold is 75% of the saturation point — this gives pods time to scale up before they hit the wall (new pods take 30–60s to start).

By workload type

Workload	How to find threshold
Async I/O (DB, HTTP calls)	`target_RPS × target_p95s`. Run sweep to confirm no saturation.
CPU-bound (sync routes)	Run sweep with `--workload cpu`. Roughly equals thread pool size (`min(32, cpu_count + 4)`).
ML inference (GPU)	Usually equals your batch size × pipeline depth. Measure with sweep against your real model endpoint.

The right KEDA query

Don't use raw sum(runway_requests_in_flight) — that scales based on total load, not per-pod load. Use average:

sum(runway_requests_in_flight) / count(up{job="your-app"})

This means: "scale when the average pod is handling more than N requests", which is what you actually want.

Kubernetes autoscaling

KEDA (recommended)

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: my-api-scaledobject
spec:
  scaleTargetRef:
    name: my-api
  minReplicaCount: 1
  maxReplicaCount: 20
  triggers:
    - type: prometheus
      metadata:
        serverAddress: http://prometheus.monitoring:9090
        metricName: runway_requests_in_flight
        # Scale when the average pod exceeds 10 in-flight requests.
        # Use the per-pod average, not raw sum — see "Finding your threshold" above.
        query: sum(runway_requests_in_flight) / count(up{job="my-api"})
        threshold: "10"

Kubernetes HPA with custom metrics

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
  metrics:
    - type: External
      external:
        metric:
          name: runway_requests_in_flight
        target:
          type: AverageValue
          averageValue: "10"   # target 10 in-flight per pod

Decoupling the metrics server

When the application is overloaded, its event loop may be saturated, causing /metrics scrape requests to time out precisely when you need them most — right before the autoscaler would fire.

The solution is to serve metrics from a server that does not share the application's event loop. asgi-runway offers two options depending on your deployment.

Option A — Embedded metrics thread (plain Docker / EC2 / single container)

Pass metrics_port to setup(). A ThreadingHTTPServer starts in a background daemon thread — no uvicorn, no asyncio, fully independent:

from asgi_runway import setup

setup(app, metrics_port=9091, include_metrics_route=False)

Prometheus scrapes port 9091. App traffic goes to port 8000.
The metrics thread is isolated from the event loop, so it cannot be blocked by in-flight application requests.
Works for both single-process uvicorn and multiprocess (gunicorn + uvicorn workers) — no shared directory required for single-process.

┌─────────────────────────────────────────────────┐
│  Single container                               │
│                                                 │
│  ┌──────────────────────┐                       │
│  │  uvicorn (port 8000) │  ← app traffic        │
│  │  asyncio event loop  │                       │
│  └──────────────────────┘                       │
│                                                 │
│  ┌──────────────────────┐                       │
│  │  metrics thread      │  ← Prometheus scrapes │
│  │  (port 9091)         │    this port          │
│  │  ThreadingHTTPServer │                       │
│  └──────────────────────┘                       │
└─────────────────────────────────────────────────┘

Option B — Sidecar process (Docker Compose / Kubernetes / ECS)

Run the exporter as a separate container alongside the app. Both containers share PROMETHEUS_MULTIPROC_DIR as a mounted volume. The exporter reads the metric files and serves them on its own port — no code in the app server.

# Requires PROMETHEUS_MULTIPROC_DIR to be set and shared
python -m asgi_runway.exporter --port 9091

┌────────────────────────────────────────────────────────────┐
│  Pod / task                                                │
│                                                            │
│  ┌─────────────────────┐   ┌──────────────────────────┐   │
│  │  uvicorn (port 8000)│   │  runway-exporter (9091)  │   │
│  │  RunwayMiddleware   │   │  python -m               │   │
│  │  writes metric files│   │  asgi_runway.exporter    │   │
│  │  to shared volume ──┼───┼──► reads metric files    │   │
│  └─────────────────────┘   └──────────────────────────┘   │
│           │                           │                    │
│      app traffic                 Prometheus scrapes        │
└────────────────────────────────────────────────────────────┘

Docker Compose:

version: "3"
services:
  app:
    image: my-api
    ports:
      - "8000:8000"
    environment:
      PROMETHEUS_MULTIPROC_DIR: /tmp/prom
    volumes:
      - prom_data:/tmp/prom
    command: uvicorn app:app --host 0.0.0.0 --port 8000

  runway-exporter:
    image: my-api          # same image, different entrypoint
    ports:
      - "9091:9091"
    environment:
      PROMETHEUS_MULTIPROC_DIR: /tmp/prom
    volumes:
      - prom_data:/tmp/prom
    command: python -m asgi_runway.exporter --port 9091

volumes:
  prom_data:

Kubernetes sidecar container:

containers:
  - name: app
    image: my-api
    ports:
      - containerPort: 8000
    env:
      - name: PROMETHEUS_MULTIPROC_DIR
        value: /tmp/prom
    volumeMounts:
      - name: prom-dir
        mountPath: /tmp/prom

  - name: runway-exporter
    image: my-api
    command: ["python", "-m", "asgi_runway.exporter", "--port", "9091"]
    ports:
      - containerPort: 9091
    env:
      - name: PROMETHEUS_MULTIPROC_DIR
        value: /tmp/prom
    volumeMounts:
      - name: prom-dir
        mountPath: /tmp/prom

volumes:
  - name: prom-dir
    emptyDir: {}

Which option to use?

Single container (EC2, plain Docker): use Option A (metrics_port).

Multiple containers (Docker Compose, Kubernetes, ECS): use Option B (sidecar) with a shared volume, so that gunicorn workers across the pod are all aggregated by the exporter.

Multi-process mode (gunicorn + uvicorn workers)

Set the env var before starting the server — prometheus_client handles the rest:

export PROMETHEUS_MULTIPROC_DIR=/tmp/prometheus_multiproc
mkdir -p $PROMETHEUS_MULTIPROC_DIR
gunicorn app:app -w 4 -k uvicorn.workers.UvicornWorker

runway_requests_in_flight will be automatically summed across all workers (multiprocess_mode="livesum").

How it works

RunwayMiddleware is a raw ASGI middleware (not BaseHTTPMiddleware, which has known streaming issues). It wraps every request:

request arrives → REQUESTS_IN_FLIGHT.inc()
       ↓
  app processes
       ↓
response sent → REQUESTS_IN_FLIGHT.dec()
             → REQUESTS_TOTAL.inc()
             → REQUEST_DURATION_SECONDS.observe()

The try/finally block ensures the gauge is decremented even if the handler raises an exception.

Note: runway_requests_in_flight only measures requests that have entered the middleware. Requests dropped by the OS TCP backlog or rejected by uvicorn's --limit-concurrency are invisible to it. See docs/request-limits.md for the full picture, including recommended production values for all three layers.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.3.1

Mar 13, 2026

This version

0.3.0

Mar 13, 2026

0.2.0

Mar 13, 2026

0.1.0

Feb 24, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

asgi_runway-0.3.0.tar.gz (21.4 kB view details)

Uploaded Mar 13, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

asgi_runway-0.3.0-py3-none-any.whl (13.9 kB view details)

Uploaded Mar 13, 2026 Python 3

File details

Details for the file asgi_runway-0.3.0.tar.gz.

File metadata

Download URL: asgi_runway-0.3.0.tar.gz
Upload date: Mar 13, 2026
Size: 21.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: Hatch/1.16.4 cpython/3.10.12 HTTPX/0.28.1

File hashes

Hashes for asgi_runway-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`c6021f8455dc01230cd0067c76ee96e62cad402d015adfde143ad99771f12930`
MD5	`cf4000629d1a4030c6d2ce840f42eb74`
BLAKE2b-256	`6e4d8741fffa3650f042ae524a2eea13e14f83ca39d697754abf4cc4fa387f98`

See more details on using hashes here.

File details

Details for the file asgi_runway-0.3.0-py3-none-any.whl.

File metadata

Download URL: asgi_runway-0.3.0-py3-none-any.whl
Upload date: Mar 13, 2026
Size: 13.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: Hatch/1.16.4 cpython/3.10.12 HTTPX/0.28.1

File hashes

Hashes for asgi_runway-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2e7f3e6acc05d1ae0489cdda61db54e0666316ee744538cb5f1065cbfa31643f`
MD5	`9b83566d56acc7677bd92e01071961e9`
BLAKE2b-256	`137c4dc46894e4c8d5eaddab7944b283c305fcf06455de58408bd0ff00aa60e1`

See more details on using hashes here.

asgi-runway 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

asgi-runway

Why not CPU / Memory / RPS?

Installation

Quick start

Metrics exposed

Per-route granularity

Excluding paths

Finding your autoscaling threshold

The formula (Little's Law)

Finding it empirically with the sweep tool

By workload type

The right KEDA query

Kubernetes autoscaling

KEDA (recommended)

Kubernetes HPA with custom metrics

Decoupling the metrics server

Option A — Embedded metrics thread (plain Docker / EC2 / single container)

Option B — Sidecar process (Docker Compose / Kubernetes / ECS)

Multi-process mode (gunicorn + uvicorn workers)

How it works

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes