CLI for benchmarking LLM inference servers (vLLM, SGLang, llama.cpp)
Project description
llm-grill
CLI for benchmarking LLM inference servers: vLLM, SGLang, llama.cpp, LiteLLM.
Measures TTFT, TPOT, end-to-end latency, throughput, success rate, KV cache quality metrics, and load ramp (breaking-point detection) on multi-turn conversation scenarios.
Install
Requires Python 3.11+ and uv.
uv tool install llm-grill
Verify:
llm-grill --version
Quick start
Copy the example scenario and adapt it to your setup:
cp scenarios/example.yaml scenarios/my-bench.yaml
# Edit URLs, model name, and API key
1. Check connectivity
llm-grill ping scenarios/my-bench.yaml
2. Run a benchmark
llm-grill run scenarios/my-bench.yaml --output results.jsonl
After the run, tables are printed automatically:
- Benchmark Summary — latency, throughput, success rate per server/model
- Conversation Quality Metrics — KV cache hit rate, turn-to-turn latency ratio, context growth factor
- Load Ramp Results — (if
ramp_levelsis set) one row per (server, model, concurrency level)
3. Generate a report from an existing results file
# Terminal table (summary + conversation metrics)
llm-grill report results.jsonl
# JSON (both sections, pipeable)
llm-grill report results.jsonl --format json
# CSV (raw requests, pandas-ready)
llm-grill report results.jsonl --format csv --output summary.csv
# Hide conversation metrics table
llm-grill report results.jsonl --no-conversations
Commands
| Command | Description |
|---|---|
llm-grill run <scenario> |
Run a benchmark, stream results to JSONL |
llm-grill ping <scenario> |
Test server connectivity |
llm-grill show-scenario <scenario> |
Validate and display a scenario |
llm-grill report <results.jsonl> |
Generate a report from a results file |
run options
| Option | Default | Description |
|---|---|---|
--output / -o |
results-<name>.jsonl |
Output file path |
--format / -f |
jsonl |
jsonl or csv |
--quiet / -q |
off | Suppress progress and tables |
report options
| Option | Default | Description |
|---|---|---|
--format / -f |
table |
table, json, or csv |
--output / -o |
— | Output path for CSV format |
--no-conversations |
off | Hide the conversation metrics table |
Global options
| Option | Description |
|---|---|
--verbose / -v |
Enable debug logging |
--version / -V |
Print version and exit |
Supported backends
| Backend | Type | Metrics source | Notes |
|---|---|---|---|
| vLLM | vllm |
Prometheus /metrics |
KV cache usage |
| SGLang | sglang |
Prometheus /metrics |
Cache hit rate |
| llama.cpp | llamacpp |
/health endpoint |
GGUF models |
| LiteLLM | litellm |
Gateway routing | Proxy for multiple backends |
| OpenAI-compatible | openai |
— | Reuses vLLM client |
Scenario format (YAML)
name: my-scenario
description: Optional description
backends:
- name: gpu-vllm
url: http://gpu-vllm:8000
api_key: none # "none", a literal key, or ${ENV_VAR}
type: vllm # vllm | sglang | llamacpp | litellm | openai
timeout: 120.0
models:
- name: devstral-small-2-24b
max_tokens: 512
temperature: 0.0
conversations:
- name: multi-turn-debug
turns:
- role: system
content: "You are an expert developer."
- role: user
content: "My FastAPI app returns 500 errors under load. What should I check?"
- role: user
content: "The DB connection pool is exhausted. How do I configure it in SQLAlchemy?"
targets:
- backend: gpu-vllm
model: devstral-small-2-24b
conversation: multi-turn-debug
load:
concurrent_users: 10
iterations: 3
ramp_up_seconds: 5.0
think_time_seconds: 0.0
Each role: user turn triggers an inference request. Conversation history (including assistant responses) is carried forward, so the server sees a growing context.
Load ramp
Add ramp_levels to sweep concurrency levels in a single run. When set, concurrent_users is ignored.
load:
iterations: 3
ramp_levels: [1, 5, 10, 20, 50, 100]
ramp_pause_seconds: 10.0 # pause between levels, default 10 s
think_time_seconds: 0.0
Results are tagged with concurrent_users_level in the JSONL output and displayed in a Load Ramp Results table sorted by (server, model, users).
Metrics
Latency & throughput
| Metric | Description |
|---|---|
| TTFT | Time to First Token — from request sent to first token received (client-side, includes network) |
| TPOT | Time Per Output Token — (E2E - TTFT) / (completion_tokens - 1) |
| E2E latency | Total time from request to last token |
| tokens/s | completion_tokens / E2E latency (per request) or total across all requests / benchmark duration |
| success rate | % of requests completed without error |
t0 → request sent
t_first → first non-empty content chunk received
t_last → stream ends ([DONE] or connection close)
TTFT = t_first - t0
E2E = t_last - t0
TPOT = (E2E - TTFT) / max(completion_tokens - 1, 1)
Measurement includes network round-trip. For cross-server comparisons, run from the same network location.
Conversation quality (multi-turn)
Computed per (server, model, conversation) group:
| Metric | Description | Interpretation |
|---|---|---|
| Turn-to-Turn Ratio | mean(TTFT turn > 0) / mean(TTFT turn 0) |
< 1 → KV cache reduces prefill time |
| Context Growth Factor | mean(E2E last turn) / mean(E2E first turn) |
> 1 → latency increases with context |
| KV Cache Hit Rate | Prompt tokens served from cache | SGLang only (Prometheus) |
| KV Cache Usage | GPU KV cache capacity used | vLLM only (Prometheus) |
GPU monitoring
Enable per-backend GPU metrics (utilization, memory, temperature, power) collected via SSH:
backends:
- name: gpu-vllm
url: http://gpu-vllm:8000
type: vllm
gpu_monitoring: true
ssh_host: gpu-vllm # defaults to URL host if omitted
ssh_user: root # default
Requires nvidia-smi on the target host and SSH key-based access.
Output format (JSONL)
One JSON object per request, written incrementally:
{
"scenario": "my-scenario",
"target_server": "gpu-vllm",
"target_model": "devstral-small-2-24b",
"conversation": "multi-turn-debug",
"turn": 1,
"iteration": 0,
"user_id": 3,
"timestamp_start": "2026-03-10T14:00:00+00:00",
"ttft_s": 0.142,
"tpot_s": 0.018,
"e2e_latency_s": 1.23,
"prompt_tokens": 45,
"completion_tokens": 64,
"tokens_per_second": 52.0,
"success": true,
"error": null,
"kv_cache_usage": 0.34,
"requests_running": 8.0,
"concurrent_users_level": 10
}
The file is valid even if the benchmark is interrupted — each line is a complete record.
Read with pandas:
df = pd.read_json("results.jsonl", lines=True)
df.groupby("target_server")[["ttft_s", "e2e_latency_s", "tokens_per_second"]].mean()
Read with polars:
df = pl.read_ndjson("results.jsonl")
df.group_by("target_server").agg(pl.col("ttft_s").mean())
API keys
Use ${ENV_VAR} syntax to read from environment variables at load time:
backends:
- name: gateway
url: http://my-litellm-proxy:4000
api_key: ${LITELLM_API_KEY}
type: litellm
export LITELLM_API_KEY="sk-..."
llm-grill run scenarios/my-scenario.yaml
Never commit literal API keys in scenario files.
LiteLLM gateway routing
When backends are behind a LiteLLM proxy, define one backend entry for the gateway and use model aliases to route:
backends:
- name: gateway
url: http://my-litellm-proxy:4000
api_key: ${LITELLM_API_KEY}
type: litellm
models:
- name: devstral-small-llama # LiteLLM alias → llama.cpp
max_tokens: 512
- name: devstral-small-vllm # LiteLLM alias → vLLM
max_tokens: 512
targets:
- backend: gateway
model: devstral-small-llama
conversation: short-code-question
- backend: gateway
model: devstral-small-vllm
conversation: short-code-question
Aliases must match model_name values in LiteLLM's config.yaml.
Troubleshooting
| Problem | Fix |
|---|---|
ModuleNotFoundError: llm_grill |
Run make install |
ValidationError on scenario load |
Run llm-grill show-scenario file.yaml for details |
| TTFT always < 1 ms | Server not streaming — check stream: true support |
All requests connection refused |
Run llm-grill ping file.yaml — check URL/port |
401 Unauthorized |
Set api_key: ${MY_VAR} and export the variable |
ping times out on LiteLLM |
LiteLLM /health does live inference — check gateway URL |
Contributing
See CONTRIBUTING.md.
License
Apache 2.0 — see LICENSE.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file llm_grill-0.1.2.tar.gz.
File metadata
- Download URL: llm_grill-0.1.2.tar.gz
- Upload date:
- Size: 327.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5bc3b0150f71507308eace537caae494f7f5aea44d4933aad5080c50f9637db8
|
|
| MD5 |
ea9c1d585199c2ad71d716f8b92e78a2
|
|
| BLAKE2b-256 |
ae2e35147dee39d3e783ad1bcfa63cd0cb767c1933a82576f9864d85ae94608a
|
Provenance
The following attestation bundles were made for llm_grill-0.1.2.tar.gz:
Publisher:
ci-publish.yml on fisheatfish/llm-grill
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
llm_grill-0.1.2.tar.gz -
Subject digest:
5bc3b0150f71507308eace537caae494f7f5aea44d4933aad5080c50f9637db8 - Sigstore transparency entry: 1239349083
- Sigstore integration time:
-
Permalink:
fisheatfish/llm-grill@8996c5aecac75e0c7730c809a730a6e11cc98af1 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/fisheatfish
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci-publish.yml@8996c5aecac75e0c7730c809a730a6e11cc98af1 -
Trigger Event:
push
-
Statement type:
File details
Details for the file llm_grill-0.1.2-py3-none-any.whl.
File metadata
- Download URL: llm_grill-0.1.2-py3-none-any.whl
- Upload date:
- Size: 33.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
01cbb06e5acf22aad77f2386648bca3628d82f356362fe994c08da48f4c69071
|
|
| MD5 |
dc9f7f057980b5b51e41f080a070cc89
|
|
| BLAKE2b-256 |
51be107a4210ef53e5d5fcb60c92293bf7c494516e45a2fb88f4c947f4c1f858
|
Provenance
The following attestation bundles were made for llm_grill-0.1.2-py3-none-any.whl:
Publisher:
ci-publish.yml on fisheatfish/llm-grill
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
llm_grill-0.1.2-py3-none-any.whl -
Subject digest:
01cbb06e5acf22aad77f2386648bca3628d82f356362fe994c08da48f4c69071 - Sigstore transparency entry: 1239349086
- Sigstore integration time:
-
Permalink:
fisheatfish/llm-grill@8996c5aecac75e0c7730c809a730a6e11cc98af1 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/fisheatfish
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci-publish.yml@8996c5aecac75e0c7730c809a730a6e11cc98af1 -
Trigger Event:
push
-
Statement type: