Skip to main content

CLI for benchmarking LLM inference servers (vLLM, SGLang, llama.cpp)

Project description

llm-grill

License Python CI

CLI for benchmarking LLM inference servers: vLLM, SGLang, llama.cpp, LiteLLM.

Measures TTFT, TPOT, end-to-end latency, throughput, success rate, KV cache quality metrics, and load ramp (breaking-point detection) on multi-turn conversation scenarios.

llm-grill Demo


Install

Requires Python 3.11+ and uv.

uv tool install llm-grill

Verify:

llm-grill --version

Quick start

Copy the example scenario and adapt it to your setup:

cp scenarios/example.yaml scenarios/my-bench.yaml
# Edit URLs, model name, and API key

1. Check connectivity

llm-grill ping scenarios/my-bench.yaml

2. Run a benchmark

llm-grill run scenarios/my-bench.yaml --output results.jsonl

After the run, tables are printed automatically:

  • Benchmark Summary — latency, throughput, success rate per server/model
  • Conversation Quality Metrics — KV cache hit rate, turn-to-turn latency ratio, context growth factor
  • Load Ramp Results — (if ramp_levels is set) one row per (server, model, concurrency level)

3. Generate a report from an existing results file

# Terminal table (summary + conversation metrics)
llm-grill report results.jsonl

# JSON (both sections, pipeable)
llm-grill report results.jsonl --format json

# CSV (raw requests, pandas-ready)
llm-grill report results.jsonl --format csv --output summary.csv

# Hide conversation metrics table
llm-grill report results.jsonl --no-conversations

Commands

Command Description
llm-grill run <scenario> Run a benchmark, stream results to JSONL
llm-grill ping <scenario> Test server connectivity
llm-grill show-scenario <scenario> Validate and display a scenario
llm-grill report <results.jsonl> Generate a report from a results file

run options

Option Default Description
--output / -o results-<name>.jsonl Output file path
--format / -f jsonl jsonl or csv
--quiet / -q off Suppress progress and tables

report options

Option Default Description
--format / -f table table, json, or csv
--output / -o Output path for CSV format
--no-conversations off Hide the conversation metrics table

Global options

Option Description
--verbose / -v Enable debug logging
--version / -V Print version and exit

Supported backends

Backend Type Metrics source Notes
vLLM vllm Prometheus /metrics KV cache usage
SGLang sglang Prometheus /metrics Cache hit rate
llama.cpp llamacpp /health endpoint GGUF models
LiteLLM litellm Gateway routing Proxy for multiple backends
OpenAI-compatible openai Reuses vLLM client

Scenario format (YAML)

name: my-scenario
description: Optional description

backends:
  - name: gpu-vllm
    url: http://gpu-vllm:8000
    api_key: none                    # "none", a literal key, or ${ENV_VAR}
    type: vllm                       # vllm | sglang | llamacpp | litellm | openai
    timeout: 120.0

models:
  - name: devstral-small-2-24b
    max_tokens: 512
    temperature: 0.0

conversations:
  - name: multi-turn-debug
    turns:
      - role: system
        content: "You are an expert developer."
      - role: user
        content: "My FastAPI app returns 500 errors under load. What should I check?"
      - role: user
        content: "The DB connection pool is exhausted. How do I configure it in SQLAlchemy?"

targets:
  - backend: gpu-vllm
    model: devstral-small-2-24b
    conversation: multi-turn-debug

load:
  concurrent_users: 10
  iterations: 3
  ramp_up_seconds: 5.0
  think_time_seconds: 0.0

Each role: user turn triggers an inference request. Conversation history (including assistant responses) is carried forward, so the server sees a growing context.

Load ramp

Add ramp_levels to sweep concurrency levels in a single run. When set, concurrent_users is ignored.

load:
  iterations: 3
  ramp_levels: [1, 5, 10, 20, 50, 100]
  ramp_pause_seconds: 10.0   # pause between levels, default 10 s
  think_time_seconds: 0.0

Results are tagged with concurrent_users_level in the JSONL output and displayed in a Load Ramp Results table sorted by (server, model, users).


Metrics

Latency & throughput

Metric Description
TTFT Time to First Token — from request sent to first token received (client-side, includes network)
TPOT Time Per Output Token — (E2E - TTFT) / (completion_tokens - 1)
E2E latency Total time from request to last token
tokens/s completion_tokens / E2E latency (per request) or total across all requests / benchmark duration
success rate % of requests completed without error
t0       → request sent
t_first  → first non-empty content chunk received
t_last   → stream ends ([DONE] or connection close)

TTFT   = t_first - t0
E2E    = t_last  - t0
TPOT   = (E2E - TTFT) / max(completion_tokens - 1, 1)

Measurement includes network round-trip. For cross-server comparisons, run from the same network location.

Conversation quality (multi-turn)

Computed per (server, model, conversation) group:

Metric Description Interpretation
Turn-to-Turn Ratio mean(TTFT turn > 0) / mean(TTFT turn 0) < 1 → KV cache reduces prefill time
Context Growth Factor mean(E2E last turn) / mean(E2E first turn) > 1 → latency increases with context
KV Cache Hit Rate Prompt tokens served from cache SGLang only (Prometheus)
KV Cache Usage GPU KV cache capacity used vLLM only (Prometheus)

GPU monitoring

Enable per-backend GPU metrics (utilization, memory, temperature, power) collected via SSH:

backends:
  - name: gpu-vllm
    url: http://gpu-vllm:8000
    type: vllm
    gpu_monitoring: true
    ssh_host: gpu-vllm       # defaults to URL host if omitted
    ssh_user: root            # default

Requires nvidia-smi on the target host and SSH key-based access.


Output format (JSONL)

One JSON object per request, written incrementally:

{
  "scenario": "my-scenario",
  "target_server": "gpu-vllm",
  "target_model": "devstral-small-2-24b",
  "conversation": "multi-turn-debug",
  "turn": 1,
  "iteration": 0,
  "user_id": 3,
  "timestamp_start": "2026-03-10T14:00:00+00:00",
  "ttft_s": 0.142,
  "tpot_s": 0.018,
  "e2e_latency_s": 1.23,
  "prompt_tokens": 45,
  "completion_tokens": 64,
  "tokens_per_second": 52.0,
  "success": true,
  "error": null,
  "kv_cache_usage": 0.34,
  "requests_running": 8.0,
  "concurrent_users_level": 10
}

The file is valid even if the benchmark is interrupted — each line is a complete record.

Read with pandas:

df = pd.read_json("results.jsonl", lines=True)
df.groupby("target_server")[["ttft_s", "e2e_latency_s", "tokens_per_second"]].mean()

Read with polars:

df = pl.read_ndjson("results.jsonl")
df.group_by("target_server").agg(pl.col("ttft_s").mean())

API keys

Use ${ENV_VAR} syntax to read from environment variables at load time:

backends:
  - name: gateway
    url: http://my-litellm-proxy:4000
    api_key: ${LITELLM_API_KEY}
    type: litellm
export LITELLM_API_KEY="sk-..."
llm-grill run scenarios/my-scenario.yaml

Never commit literal API keys in scenario files.


LiteLLM gateway routing

When backends are behind a LiteLLM proxy, define one backend entry for the gateway and use model aliases to route:

backends:
  - name: gateway
    url: http://my-litellm-proxy:4000
    api_key: ${LITELLM_API_KEY}
    type: litellm

models:
  - name: devstral-small-llama    # LiteLLM alias → llama.cpp
    max_tokens: 512
  - name: devstral-small-vllm     # LiteLLM alias → vLLM
    max_tokens: 512

targets:
  - backend: gateway
    model: devstral-small-llama
    conversation: short-code-question
  - backend: gateway
    model: devstral-small-vllm
    conversation: short-code-question

Aliases must match model_name values in LiteLLM's config.yaml.


Troubleshooting

Problem Fix
ModuleNotFoundError: llm_grill Run make install
ValidationError on scenario load Run llm-grill show-scenario file.yaml for details
TTFT always < 1 ms Server not streaming — check stream: true support
All requests connection refused Run llm-grill ping file.yaml — check URL/port
401 Unauthorized Set api_key: ${MY_VAR} and export the variable
ping times out on LiteLLM LiteLLM /health does live inference — check gateway URL

Contributing

See CONTRIBUTING.md.

License

Apache 2.0 — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_grill-0.1.0.tar.gz (324.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llm_grill-0.1.0-py3-none-any.whl (33.3 kB view details)

Uploaded Python 3

File details

Details for the file llm_grill-0.1.0.tar.gz.

File metadata

  • Download URL: llm_grill-0.1.0.tar.gz
  • Upload date:
  • Size: 324.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for llm_grill-0.1.0.tar.gz
Algorithm Hash digest
SHA256 e286c0bb4802e1be33e47f641fd152103c0b82a4754c6ddeace5fd43b00902d8
MD5 479ad3b16c3be14aee4084c6c556a6f4
BLAKE2b-256 2a577e936ba570c26f8ea2d2ff7115eba7c811c63ffef75ab6fdc70c15328a1a

See more details on using hashes here.

Provenance

The following attestation bundles were made for llm_grill-0.1.0.tar.gz:

Publisher: ci.yml on fisheatfish/llm-grill

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file llm_grill-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: llm_grill-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 33.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for llm_grill-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 98f479e30fbd58f6ddb868c08b7f6da59e05df9300e9a2e6eeaa8c43658e14e7
MD5 1319ab4b86ee52eae6a210f5e058d20b
BLAKE2b-256 65fbb5e6f694bcd42b43911c99f0d7214d5e4e06003699759a73ccf780d4a8f2

See more details on using hashes here.

Provenance

The following attestation bundles were made for llm_grill-0.1.0-py3-none-any.whl:

Publisher: ci.yml on fisheatfish/llm-grill

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page