CLI for benchmarking LLM inference servers (vLLM, SGLang, llama.cpp)

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

fisheatfish

These details have not been verified by PyPI

Project description

llm-grill

CLI for benchmarking LLM inference servers: vLLM, SGLang, llama.cpp, LiteLLM.

Measures TTFT, TPOT, end-to-end latency, throughput, success rate, KV cache quality metrics, and load ramp (breaking-point detection) on multi-turn conversation scenarios.

llm-grill Demo

Install

Requires Python 3.11+ and uv.

uv tool install llm-grill

Verify:

llm-grill --version

Quick start

Copy the example scenario and adapt it to your setup:

cp scenarios/example.yaml scenarios/my-bench.yaml
# Edit URLs, model name, and API key

1. Check connectivity

llm-grill ping scenarios/my-bench.yaml

2. Run a benchmark

llm-grill run scenarios/my-bench.yaml --output results.jsonl

After the run, tables are printed automatically:

Benchmark Summary — latency, throughput, success rate per server/model
Conversation Quality Metrics — KV cache hit rate, turn-to-turn latency ratio, context growth factor
Load Ramp Results — (if ramp_levels is set) one row per (server, model, concurrency level)

3. Generate a report from an existing results file

# Terminal table (summary + conversation metrics)
llm-grill report results.jsonl

# JSON (both sections, pipeable)
llm-grill report results.jsonl --format json

# CSV (raw requests, pandas-ready)
llm-grill report results.jsonl --format csv --output summary.csv

# Hide conversation metrics table
llm-grill report results.jsonl --no-conversations

Commands

Command	Description
`llm-grill run <scenario>`	Run a benchmark, stream results to JSONL
`llm-grill ping <scenario>`	Test server connectivity
`llm-grill show-scenario <scenario>`	Validate and display a scenario
`llm-grill report <results.jsonl>`	Generate a report from a results file

`run` options

Option	Default	Description
`--output / -o`	`results-<name>.jsonl`	Output file path
`--format / -f`	`jsonl`	`jsonl` or `csv`
`--quiet / -q`	off	Suppress progress and tables

`report` options

Option	Default	Description
`--format / -f`	`table`	`table`, `json`, or `csv`
`--output / -o`	—	Output path for CSV format
`--no-conversations`	off	Hide the conversation metrics table

Global options

Option	Description
`--verbose / -v`	Enable debug logging
`--version / -V`	Print version and exit

Supported backends

Backend	Type	Metrics source	Notes
vLLM	`vllm`	Prometheus `/metrics`	KV cache usage
SGLang	`sglang`	Prometheus `/metrics`	Cache hit rate
llama.cpp	`llamacpp`	`/health` endpoint	GGUF models
LiteLLM	`litellm`	Gateway routing	Proxy for multiple backends
OpenAI-compatible	`openai`	—	Reuses vLLM client

Scenario format (YAML)

name: my-scenario
description: Optional description

backends:
  - name: gpu-vllm
    url: http://gpu-vllm:8000
    api_key: none                    # "none", a literal key, or ${ENV_VAR}
    type: vllm                       # vllm | sglang | llamacpp | litellm | openai
    timeout: 120.0

models:
  - name: devstral-small-2-24b
    max_tokens: 512
    temperature: 0.0

conversations:
  - name: multi-turn-debug
    turns:
      - role: system
        content: "You are an expert developer."
      - role: user
        content: "My FastAPI app returns 500 errors under load. What should I check?"
      - role: user
        content: "The DB connection pool is exhausted. How do I configure it in SQLAlchemy?"

targets:
  - backend: gpu-vllm
    model: devstral-small-2-24b
    conversation: multi-turn-debug

load:
  concurrent_users: 10
  iterations: 3
  ramp_up_seconds: 5.0
  think_time_seconds: 0.0

Each role: user turn triggers an inference request. Conversation history (including assistant responses) is carried forward, so the server sees a growing context.

Load ramp

Add ramp_levels to sweep concurrency levels in a single run. When set, concurrent_users is ignored.

load:
  iterations: 3
  ramp_levels: [1, 5, 10, 20, 50, 100]
  ramp_pause_seconds: 10.0   # pause between levels, default 10 s
  think_time_seconds: 0.0

Results are tagged with concurrent_users_level in the JSONL output and displayed in a Load Ramp Results table sorted by (server, model, users).

Metrics

Latency & throughput

Metric	Description
TTFT	Time to First Token — from request sent to first token received (client-side, includes network)
TPOT	Time Per Output Token — `(E2E - TTFT) / (completion_tokens - 1)`
E2E latency	Total time from request to last token
tokens/s	`completion_tokens / E2E latency` (per request) or total across all requests / benchmark duration
success rate	% of requests completed without error

t0       → request sent
t_first  → first non-empty content chunk received
t_last   → stream ends ([DONE] or connection close)

TTFT   = t_first - t0
E2E    = t_last  - t0
TPOT   = (E2E - TTFT) / max(completion_tokens - 1, 1)

Measurement includes network round-trip. For cross-server comparisons, run from the same network location.

Conversation quality (multi-turn)

Computed per (server, model, conversation) group:

Metric	Description	Interpretation
Turn-to-Turn Ratio	`mean(TTFT turn > 0) / mean(TTFT turn 0)`	< 1 → KV cache reduces prefill time
Context Growth Factor	`mean(E2E last turn) / mean(E2E first turn)`	> 1 → latency increases with context
KV Cache Hit Rate	Prompt tokens served from cache	SGLang only (Prometheus)
KV Cache Usage	GPU KV cache capacity used	vLLM only (Prometheus)

GPU monitoring

Enable per-backend GPU metrics (utilization, memory, temperature, power) collected via SSH:

backends:
  - name: gpu-vllm
    url: http://gpu-vllm:8000
    type: vllm
    gpu_monitoring: true
    ssh_host: gpu-vllm       # defaults to URL host if omitted
    ssh_user: root            # default

Requires nvidia-smi on the target host and SSH key-based access.

Output format (JSONL)

One JSON object per request, written incrementally:

{
  "scenario": "my-scenario",
  "target_server": "gpu-vllm",
  "target_model": "devstral-small-2-24b",
  "conversation": "multi-turn-debug",
  "turn": 1,
  "iteration": 0,
  "user_id": 3,
  "timestamp_start": "2026-03-10T14:00:00+00:00",
  "ttft_s": 0.142,
  "tpot_s": 0.018,
  "e2e_latency_s": 1.23,
  "prompt_tokens": 45,
  "completion_tokens": 64,
  "tokens_per_second": 52.0,
  "success": true,
  "error": null,
  "kv_cache_usage": 0.34,
  "requests_running": 8.0,
  "concurrent_users_level": 10
}

The file is valid even if the benchmark is interrupted — each line is a complete record.

Read with pandas:

df = pd.read_json("results.jsonl", lines=True)
df.groupby("target_server")[["ttft_s", "e2e_latency_s", "tokens_per_second"]].mean()

Read with polars:

df = pl.read_ndjson("results.jsonl")
df.group_by("target_server").agg(pl.col("ttft_s").mean())

API keys

Use ${ENV_VAR} syntax to read from environment variables at load time:

backends:
  - name: gateway
    url: http://my-litellm-proxy:4000
    api_key: ${LITELLM_API_KEY}
    type: litellm

export LITELLM_API_KEY="sk-..."
llm-grill run scenarios/my-scenario.yaml

Never commit literal API keys in scenario files.

LiteLLM gateway routing

When backends are behind a LiteLLM proxy, define one backend entry for the gateway and use model aliases to route:

backends:
  - name: gateway
    url: http://my-litellm-proxy:4000
    api_key: ${LITELLM_API_KEY}
    type: litellm

models:
  - name: devstral-small-llama    # LiteLLM alias → llama.cpp
    max_tokens: 512
  - name: devstral-small-vllm     # LiteLLM alias → vLLM
    max_tokens: 512

targets:
  - backend: gateway
    model: devstral-small-llama
    conversation: short-code-question
  - backend: gateway
    model: devstral-small-vllm
    conversation: short-code-question

Aliases must match model_name values in LiteLLM's config.yaml.

Troubleshooting

Problem	Fix
`ModuleNotFoundError: llm_grill`	Run `make install`
`ValidationError` on scenario load	Run `llm-grill show-scenario file.yaml` for details
TTFT always < 1 ms	Server not streaming — check `stream: true` support
All requests `connection refused`	Run `llm-grill ping file.yaml` — check URL/port
`401 Unauthorized`	Set `api_key: ${MY_VAR}` and export the variable
`ping` times out on LiteLLM	LiteLLM `/health` does live inference — check gateway URL

Contributing

See CONTRIBUTING.md.

License

Apache 2.0 — see LICENSE.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

fisheatfish

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.4

Apr 21, 2026

0.1.3

Apr 6, 2026

This version

0.1.2

Apr 5, 2026

0.1.1

Apr 1, 2026

0.1.0

Mar 19, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_grill-0.1.2.tar.gz (327.4 kB view details)

Uploaded Apr 5, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

llm_grill-0.1.2-py3-none-any.whl (33.7 kB view details)

Uploaded Apr 5, 2026 Python 3

File details

Details for the file llm_grill-0.1.2.tar.gz.

File metadata

Download URL: llm_grill-0.1.2.tar.gz
Upload date: Apr 5, 2026
Size: 327.4 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for llm_grill-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`5bc3b0150f71507308eace537caae494f7f5aea44d4933aad5080c50f9637db8`
MD5	`ea9c1d585199c2ad71d716f8b92e78a2`
BLAKE2b-256	`ae2e35147dee39d3e783ad1bcfa63cd0cb767c1933a82576f9864d85ae94608a`

See more details on using hashes here.

Provenance

The following attestation bundles were made for llm_grill-0.1.2.tar.gz:

Publisher: ci-publish.yml on fisheatfish/llm-grill

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: llm_grill-0.1.2.tar.gz
- Subject digest: 5bc3b0150f71507308eace537caae494f7f5aea44d4933aad5080c50f9637db8
- Sigstore transparency entry: 1239349083
- Sigstore integration time: Apr 5, 2026
Source repository:
- Permalink: fisheatfish/llm-grill@8996c5aecac75e0c7730c809a730a6e11cc98af1
- Branch / Tag: refs/heads/main
- Owner: https://github.com/fisheatfish
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: ci-publish.yml@8996c5aecac75e0c7730c809a730a6e11cc98af1
- Trigger Event: push

File details

Details for the file llm_grill-0.1.2-py3-none-any.whl.

File metadata

Download URL: llm_grill-0.1.2-py3-none-any.whl
Upload date: Apr 5, 2026
Size: 33.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for llm_grill-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`01cbb06e5acf22aad77f2386648bca3628d82f356362fe994c08da48f4c69071`
MD5	`dc9f7f057980b5b51e41f080a070cc89`
BLAKE2b-256	`51be107a4210ef53e5d5fcb60c92293bf7c494516e45a2fb88f4c947f4c1f858`

See more details on using hashes here.

Provenance

The following attestation bundles were made for llm_grill-0.1.2-py3-none-any.whl:

Publisher: ci-publish.yml on fisheatfish/llm-grill

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: llm_grill-0.1.2-py3-none-any.whl
- Subject digest: 01cbb06e5acf22aad77f2386648bca3628d82f356362fe994c08da48f4c69071
- Sigstore transparency entry: 1239349086
- Sigstore integration time: Apr 5, 2026
Source repository:
- Permalink: fisheatfish/llm-grill@8996c5aecac75e0c7730c809a730a6e11cc98af1
- Branch / Tag: refs/heads/main
- Owner: https://github.com/fisheatfish
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: ci-publish.yml@8996c5aecac75e0c7730c809a730a6e11cc98af1
- Trigger Event: push

llm-grill 0.1.2

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

llm-grill

Install

Quick start

Commands

run options

report options

Global options

Supported backends

Scenario format (YAML)

Load ramp

Metrics

Latency & throughput

Conversation quality (multi-turn)

GPU monitoring

Output format (JSONL)

API keys

LiteLLM gateway routing

Troubleshooting

Contributing

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`run` options

`report` options