Skip to main content

CLI for benchmarking LLM inference servers (vLLM, SGLang, llama.cpp)

Project description

llm-grill

License Python CI

CLI for benchmarking LLM inference servers: vLLM, SGLang, llama.cpp, LiteLLM.

Measures TTFT, TPOT, end-to-end latency, throughput, success rate, KV cache quality metrics, and load ramp (breaking-point detection) on multi-turn conversation scenarios.

llm-grill Demo


Install

Requires Python 3.11+ and uv.

uv tool install llm-grill

Verify:

llm-grill --version

Quick start

Copy the example scenario and adapt it to your setup:

cp scenarios/example.yaml scenarios/my-bench.yaml
# Edit URLs, model name, and API key

1. Check connectivity

llm-grill ping scenarios/my-bench.yaml

2. Run a benchmark

llm-grill run scenarios/my-bench.yaml --output results.jsonl

After the run, tables are printed automatically:

  • Benchmark Summary — latency, throughput, success rate per server/model
  • Conversation Quality Metrics — KV cache hit rate, turn-to-turn latency ratio, context growth factor
  • Load Ramp Results — (if ramp_levels is set) one row per (server, model, concurrency level)

3. Generate a report from an existing results file

# Terminal table (summary + conversation metrics)
llm-grill report results.jsonl

# JSON (both sections, pipeable)
llm-grill report results.jsonl --format json

# CSV (raw requests, pandas-ready)
llm-grill report results.jsonl --format csv --output summary.csv

# Hide conversation metrics table
llm-grill report results.jsonl --no-conversations

Commands

Command Description
llm-grill run <scenario> Run a benchmark, stream results to JSONL
llm-grill ping <scenario> Test server connectivity
llm-grill show-scenario <scenario> Validate and display a scenario
llm-grill report <results.jsonl> Generate a report from a results file

run options

Option Default Description
--output / -o results-<name>.jsonl Output file path
--format / -f jsonl jsonl or csv
--quiet / -q off Suppress progress and tables

report options

Option Default Description
--format / -f table table, json, or csv
--output / -o Output path for CSV format
--no-conversations off Hide the conversation metrics table

Global options

Option Description
--verbose / -v Enable debug logging
--version / -V Print version and exit

Supported backends

Backend Type Metrics source Notes
vLLM vllm Prometheus /metrics KV cache usage
SGLang sglang Prometheus /metrics Cache hit rate
llama.cpp llamacpp /health endpoint GGUF models
LiteLLM litellm Gateway routing Proxy for multiple backends
OpenAI-compatible openai Reuses vLLM client

Scenario format (YAML)

name: my-scenario
description: Optional description

backends:
  - name: gpu-vllm
    url: http://gpu-vllm:8000
    api_key: none                    # "none", a literal key, or ${ENV_VAR}
    type: vllm                       # vllm | sglang | llamacpp | litellm | openai
    timeout: 120.0

models:
  - name: devstral-small-2-24b
    max_tokens: 512
    temperature: 0.0

conversations:
  - name: multi-turn-debug
    turns:
      - role: system
        content: "You are an expert developer."
      - role: user
        content: "My FastAPI app returns 500 errors under load. What should I check?"
      - role: user
        content: "The DB connection pool is exhausted. How do I configure it in SQLAlchemy?"

targets:
  - backend: gpu-vllm
    model: devstral-small-2-24b
    conversation: multi-turn-debug

load:
  concurrent_users: 10
  iterations: 3
  ramp_up_seconds: 5.0
  think_time_seconds: 0.0

Each role: user turn triggers an inference request. Conversation history (including assistant responses) is carried forward, so the server sees a growing context.

Load ramp

Add ramp_levels to sweep concurrency levels in a single run. When set, concurrent_users is ignored.

load:
  iterations: 3
  ramp_levels: [1, 5, 10, 20, 50, 100]
  ramp_pause_seconds: 10.0   # pause between levels, default 10 s
  think_time_seconds: 0.0

Results are tagged with concurrent_users_level in the JSONL output and displayed in a Load Ramp Results table sorted by (server, model, users).


Metrics

Latency & throughput

Metric Description
TTFT Time to First Token — from request sent to first token received (client-side, includes network)
TPOT Time Per Output Token — (E2E - TTFT) / (completion_tokens - 1)
E2E latency Total time from request to last token
tokens/s completion_tokens / E2E latency (per request) or total across all requests / benchmark duration
success rate % of requests completed without error
t0       → request sent
t_first  → first non-empty content chunk received
t_last   → stream ends ([DONE] or connection close)

TTFT   = t_first - t0
E2E    = t_last  - t0
TPOT   = (E2E - TTFT) / max(completion_tokens - 1, 1)

Measurement includes network round-trip. For cross-server comparisons, run from the same network location.

Conversation quality (multi-turn)

Computed per (server, model, conversation) group:

Metric Description Interpretation
Turn-to-Turn Ratio mean(TTFT turn > 0) / mean(TTFT turn 0) < 1 → KV cache reduces prefill time
Context Growth Factor mean(E2E last turn) / mean(E2E first turn) > 1 → latency increases with context
KV Cache Hit Rate Prompt tokens served from cache SGLang only (Prometheus)
KV Cache Usage GPU KV cache capacity used vLLM only (Prometheus)

GPU monitoring

Enable per-backend GPU metrics (utilization, memory, temperature, power) collected via SSH:

backends:
  - name: gpu-vllm
    url: http://gpu-vllm:8000
    type: vllm
    gpu_monitoring: true
    ssh_host: gpu-vllm       # defaults to URL host if omitted
    ssh_user: root            # default

Requires nvidia-smi on the target host and SSH key-based access.


Output format (JSONL)

One JSON object per request, written incrementally:

{
  "scenario": "my-scenario",
  "target_server": "gpu-vllm",
  "target_model": "devstral-small-2-24b",
  "conversation": "multi-turn-debug",
  "turn": 1,
  "iteration": 0,
  "user_id": 3,
  "timestamp_start": "2026-03-10T14:00:00+00:00",
  "ttft_s": 0.142,
  "tpot_s": 0.018,
  "e2e_latency_s": 1.23,
  "prompt_tokens": 45,
  "completion_tokens": 64,
  "tokens_per_second": 52.0,
  "success": true,
  "error": null,
  "kv_cache_usage": 0.34,
  "requests_running": 8.0,
  "concurrent_users_level": 10
}

The file is valid even if the benchmark is interrupted — each line is a complete record.

Read with pandas:

df = pd.read_json("results.jsonl", lines=True)
df.groupby("target_server")[["ttft_s", "e2e_latency_s", "tokens_per_second"]].mean()

Read with polars:

df = pl.read_ndjson("results.jsonl")
df.group_by("target_server").agg(pl.col("ttft_s").mean())

API keys

Use ${ENV_VAR} syntax to read from environment variables at load time:

backends:
  - name: gateway
    url: http://my-litellm-proxy:4000
    api_key: ${LITELLM_API_KEY}
    type: litellm
export LITELLM_API_KEY="sk-..."
llm-grill run scenarios/my-scenario.yaml

Never commit literal API keys in scenario files.


LiteLLM gateway routing

When backends are behind a LiteLLM proxy, define one backend entry for the gateway and use model aliases to route:

backends:
  - name: gateway
    url: http://my-litellm-proxy:4000
    api_key: ${LITELLM_API_KEY}
    type: litellm

models:
  - name: devstral-small-llama    # LiteLLM alias → llama.cpp
    max_tokens: 512
  - name: devstral-small-vllm     # LiteLLM alias → vLLM
    max_tokens: 512

targets:
  - backend: gateway
    model: devstral-small-llama
    conversation: short-code-question
  - backend: gateway
    model: devstral-small-vllm
    conversation: short-code-question

Aliases must match model_name values in LiteLLM's config.yaml.


Troubleshooting

Problem Fix
ModuleNotFoundError: llm_grill Run make install
ValidationError on scenario load Run llm-grill show-scenario file.yaml for details
TTFT always < 1 ms Server not streaming — check stream: true support
All requests connection refused Run llm-grill ping file.yaml — check URL/port
401 Unauthorized Set api_key: ${MY_VAR} and export the variable
ping times out on LiteLLM LiteLLM /health does live inference — check gateway URL

Contributing

See CONTRIBUTING.md.

License

Apache 2.0 — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_grill-0.1.1.tar.gz (325.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llm_grill-0.1.1-py3-none-any.whl (33.6 kB view details)

Uploaded Python 3

File details

Details for the file llm_grill-0.1.1.tar.gz.

File metadata

  • Download URL: llm_grill-0.1.1.tar.gz
  • Upload date:
  • Size: 325.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for llm_grill-0.1.1.tar.gz
Algorithm Hash digest
SHA256 79dc0e2f05dbb0a0e2589b066bb7f93b829b6047c02138afea16ab02f5aac77d
MD5 c298aaa854f1edb3e7baf66f773937ae
BLAKE2b-256 247b8cad8d892f869a0f717fe36f906297752e334bc2f844fef1dabfb730ab25

See more details on using hashes here.

Provenance

The following attestation bundles were made for llm_grill-0.1.1.tar.gz:

Publisher: ci.yml on fisheatfish/llm-grill

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file llm_grill-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: llm_grill-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 33.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for llm_grill-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 72204b844197d12f032994cd671d6d63cbb2cc5e277b7a7332d593a641644af2
MD5 948bc900591dda8a770bddf84dfeba02
BLAKE2b-256 42c7176ccb48855c755349d59249d1fbfb7e01948522c305f99875a71b0c3230

See more details on using hashes here.

Provenance

The following attestation bundles were made for llm_grill-0.1.1-py3-none-any.whl:

Publisher: ci.yml on fisheatfish/llm-grill

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page