Skip to main content

GPU inference benchmarking with opinionated diagnostics

Project description

cane-gpu-perf

GPU inference benchmarking with opinionated diagnostics and deep hardware analysis.

Quick Start

# Single benchmark
cane-perf bench --model arcee-ai/trinity-large-thinking --backend openrouter --concurrency 8 --diagnose

# Deep GPU analysis (NVML telemetry, roofline model, power/thermal, prefill/decode)
cane-perf bench --model meta-llama/Llama-3-7b --backend vllm --concurrency 8 --deep

# With custom endpoint
cane-perf bench --model my-model --backend vllm --base-url http://localhost:8000/v1/chat/completions

Deep GPU Analysis

The --deep flag enables hardware-level GPU profiling on top of standard HTTP-layer metrics. Requires an NVIDIA GPU and pip install cane-gpu-perf[gpu].

What it adds:

  • NVML telemetry collected every 100ms during benchmarks: SM utilization, memory usage, power draw, temperature, clock speeds, PCIe throughput. Per-GPU when multi-GPU.
  • Prefill vs decode separation with inter-token latency percentiles, prefill throughput (input tok/s), decode throughput (output tok/s), and time-in-phase breakdown.
  • Roofline model classifying the workload as compute-bound, memory-bandwidth-bound, or under-utilized. Includes specs for 20+ GPUs (T4 through B200, RTX consumer cards).
  • Power and thermal efficiency: tokens/watt, tokens/joule, electricity cost per 1M tokens, thermal headroom, clock throttle detection.
  • Multi-GPU topology: NVLink vs PCIe interconnect detection, per-GPU utilization balance, straggler detection.
cane-perf bench --model meta-llama/Llama-3-7b --backend vllm --deep

Output:

Results:
  Requests:  100 total, 0 failed
  Latency:   p50=820ms  p95=1450ms  p99=2100ms
  TTFT:      p50=95ms   p95=180ms   p99=310ms
  Throughput: 142.3 tok/s aggregate, 28.5 tok/s mean per-request

Phase Analysis:
  Prefill:   1052 tok/s  (12% of request time)
  Decode:    31 tok/s  ITL p50=32ms p95=48ms p99=71ms

GPU 0: NVIDIA A100-SXM4-80GB
  Utilization    72%          min=45% p95=89%
  Memory         42.3 / 80.0 GB    peak=42.3GB mean=41.8GB
  Temperature    67C peak     mean=64C
  Power          287W mean    peak=312W limit=400W
  SM Clock       1410 MHz mean     min=1380 max=1410 MHz

Efficiency:
  Energy:    0.50 tok/W  0.0014 tok/J
  Power cost: $0.0561/1M tokens

## Findings

INFO: Memory-bandwidth-bound (62% of 2039 GB/s)
   Decode phase uses 62% of peak bandwidth. Expected for autoregressive generation.
   -> INT8 quantization: ~2x decode throughput. Speculative decoding: 2-3x.
   Expected impact: INT8: ~2x decode throughput. INT4: ~4x. Speculative decoding: 2-3x.

INFO: Prefill 1052 tok/s vs decode 31 tok/s (33.9x)
   Prefill processes tokens in parallel (compute-bound), decode is sequential (memory-bound).
   -> Focus optimization on whichever phase dominates your workload.
   Expected impact: Targeted optimization based on workload profile

INFO: Energy: 0.50 tok/W, 0.0014 tok/J
   Power draw: 287W mean / 312W peak (limit: 400W, 28% headroom).
   -> Compare across quantization levels: INT8 typically doubles tok/W.
   Expected impact: Quantization: ~2x energy efficiency for decode

Workload Analysis

Run realistic workload scenarios and get actionable findings:

cane-perf analyze --model arcee-ai/trinity-large-thinking --backend vllm --scenario chatbot

# With deep GPU analysis across all scenario phases
cane-perf analyze --model meta-llama/Llama-3-7b --backend vllm --scenario chatbot --deep

Output:

## Findings

CRITICAL: GPU severely under-utilized (38%)
   GPU is idle more than half the time, waiting for data.
   -> Increase batch size or concurrency. Try --concurrency 8.
   Expected impact: 2-4x throughput improvement

WARNING: High TTFT variance (p99/p50 = 6.2x)
   Some requests take 6x longer than median.
   -> Investigate cold starts or KV cache eviction.
   Expected impact: More predictable user experience

INFO: Pareto-optimal configs: vllm, sglang
   Out of 4 configs, 2 are on the Pareto frontier.
   -> Choose vllm for latency, sglang for structured output.
   Expected impact: Eliminate suboptimal configurations

Available scenarios: chatbot, rag, batch, code

Scenarios

Scenario What it simulates Key metric
chatbot 50 concurrent chat users, 3 load phases TTFT p95 < 1500ms
rag Long-context RAG pipeline (1K-16K tokens) TTFT p95 < 5000ms
batch Offline batch processing, concurrency sweep Max aggregate tok/s
code Coding assistant (autocomplete + full gen) TTFT < 300ms (autocomplete)

Diagnostic Categories

Standard diagnostics (always available with --diagnose or --deep):

Category What it checks
throughput GPU utilization, concurrency scaling, throughput ceiling
latency TTFT, latency variance, p99/p50 ratio
reliability Failure rate, error patterns
memory GPU memory usage vs capacity
config Concurrency settings, batching
comparison Backend comparison, Pareto-optimal configs
scaling Concurrency scaling efficiency

Deep diagnostics (with --deep):

Category What it checks
roofline Compute-bound vs memory-bandwidth-bound classification
phase_balance Prefill vs decode time split, inter-token latency
thermal Clock throttling, temperature headroom
efficiency Tokens/watt, energy cost, power headroom
scaling Multi-GPU utilization balance, interconnect type
memory_pressure KV cache growth, OOM risk under load

Architecture

cane_gpu_perf/
  config.py            # BenchmarkConfig, BenchmarkResult dataclasses
  utils/tokens.py      # tiktoken-based token counting
  bench/runner.py      # Benchmark runner (streaming HTTP, metrics, GPU collector)
  diagnose/engine.py   # DiagnoseEngine, opinionated findings (13 standard + 7 deep checks)
  scenarios/           # Workload scenarios (chatbot, rag, batch, code)
    base.py            # Scenario dataclass
    runner.py          # ScenarioRunner (multi-phase + SLA checks)
    *.py               # Individual scenario definitions
  gpu/                 # Deep GPU analysis (requires nvidia-ml-py)
    collector.py       # NVML telemetry collector (background thread, 100ms sampling)
    roofline.py        # Roofline model (20+ GPU specs, compute vs bandwidth classification)
    efficiency.py      # Power/thermal efficiency (tok/W, tok/J, cost, throttle detection)
    topology.py        # Multi-GPU topology (NVLink vs PCIe, utilization balance)
  cli/main.py          # CLI entry point (bench, analyze commands)

Installation

pip install -e .

# With GPU telemetry support (requires NVIDIA GPU)
pip install -e ".[gpu]"

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cane_gpu_perf-0.1.0.tar.gz (29.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cane_gpu_perf-0.1.0-py3-none-any.whl (37.2 kB view details)

Uploaded Python 3

File details

Details for the file cane_gpu_perf-0.1.0.tar.gz.

File metadata

  • Download URL: cane_gpu_perf-0.1.0.tar.gz
  • Upload date:
  • Size: 29.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.1

File hashes

Hashes for cane_gpu_perf-0.1.0.tar.gz
Algorithm Hash digest
SHA256 0b216eba5fad3f9ca55aed1f17a62c74c06eda0516e5dc41365bcf4707381947
MD5 3219756480725ed4725c41f5c94df37d
BLAKE2b-256 79aa3e5fba60fded2d01a2bdb87f580d4af1200418c54466cbb69798be5a9d1a

See more details on using hashes here.

File details

Details for the file cane_gpu_perf-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: cane_gpu_perf-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 37.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.1

File hashes

Hashes for cane_gpu_perf-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5e81af3083eadb11d5c5c89fee1da92a06639785366eef28b7f7698a64c0aa1f
MD5 66e64ffbe0c96c7e36f813c3eca16065
BLAKE2b-256 c72662ad5373d8930f8b8ceb0498b6425e16175685b8fabe377d748bdc1e7425

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page