Performance benchmarking tool for LLM Serving backends with multi-turn long-context workloads

These details have not been verified by PyPI

Project links

Project description

ClawPerfBench

Performance benchmarking tool for LLM Serving backends with multi-turn long-context workloads.

中文文档

Built on EvalScope's perf infrastructure, adding:

Multi-turn context model: System Prefix + User Prefix + History + Current Input
Append-mode compaction: Clear history, grow user prefix when context reaches limits
User arrival scheduling: Burst, steady, or Poisson arrival patterns
System metrics polling: Prometheus endpoint support for vLLM, SGLang, MindIE
Per-user + per-turn metrics: TTFT, TPOT, ITL with compaction tracking
Prefix cache simulation: Trie-based HBM + external prefix cache hit rate tracking in mock server

ClawPerf Benchmark Output

Installation

pip install clawperf

For the mock server used in testing:

pip install clawperf[mock-server]

For development:

pip install clawperf[dev]

Install from source (recommended for development):

git clone https://github.com/Potterluo/ClawPerf.git
cd ClawPerf
uv sync --extra dev --extra mock-server

Quick Start

Run a benchmark

clawperf \
  --endpoint http://localhost:8000/v1/chat/completions \
  --model qwen3-32b \
  --num-users 5 \
  --user-arrival steady:2 \
  --max-turns 10 \
  --output results.json

Start mock server (for testing)

clawperf-mock-server --port 8080

End-to-end test with mock server

# Start mock server
clawperf-mock-server --port 8080

# Run benchmark against it
clawperf \
  --endpoint http://localhost:8080/v1/chat/completions \
  --model Qwen/Qwen2.5-7B-Instruct \
  --tokenizer Qwen/Qwen2.5-7B-Instruct \
  --num-users 4 \
  --max-turns 5 \
  --max-context-tokens 200000 \
  --metrics-endpoint http://localhost:8080/metrics \
  --backend vllm \
  --verbose

Hit-rate test mode (controlled prefix-cache hit rate)

Instead of a multi-turn scenario, run a controlled prefix-cache hit-rate test: specify input/output length and a target hit rate, and ClawPerf constructs prompts with a known shared-prefix / unique-suffix split, prefills the prefixes, then measures the actual hit rate from the server's Prometheus counters.

clawperf --mode hitrate \
  --endpoint http://localhost:8000/v1/chat/completions \
  --model qwen3-32b --tokenizer qwen3-32b \
  --num-requests 100 --input-len 1024 --output-len 128 \
  --hit-rate 0.5 \        # target 50% (or --prefix-len 512)
  --prefix-num 10 \       # 10 distinct prefixes -> 10 requests reuse each
  --concurrency 20 \
  --metrics-endpoint http://localhost:8000/metrics --backend vllm \
  --reset-cache

How it works (borrowed from aisbench / vLLM prefix_repetition):

Each request = [shared prefix] + [3 boundary tokens] + [unique suffix]. The boundary tokens force the cache to stop at exactly prefix_len, so the hit is precisely the shared portion.
--prefix-num distinct prefixes are assigned round-robin and shuffled so reuse happens under concurrency (not back-to-back duplicates).
--prefill (default on) injects each distinct prefix with output_len=1 before measuring, so even the first request per prefix hits.
The summary prints TARGET vs MEASURED hit rate (measured from vllm:prefix_cache_hits_total/queries_total deltas), per-engine breakdown, and TTFT/TPOT percentiles.

--hit-rate (fraction) and --prefix-len (absolute) are mutually exclusive; one derives the other from --input-len.

SLO capacity sweep mode (find max concurrent users)

Specify TTFT/TPOT SLO targets; ClawPerf sweeps concurrency (closed-loop, each user sends back-to-back multi-turn requests) and finds the max users the system can sustain while meeting the SLO. Reuses the scenario workload (system/user prefix, input/output per turn, max_context, compaction).

clawperf --mode slo \
  --endpoint http://localhost:8000/v1/chat/completions \
  --model qwen3-32b --tokenizer qwen3-32b \
  --slo-ttft-ms 500 --slo-tpot-ms 30 \   # P99 must be ≤ these
  --slo-percentile 0.99 \                 # P99 (or 0.95/0.90)
  --slo-min-users 1 --slo-max-users 200 \
  --slo-step-strategy geometric \         # double each step (or linear)
  --slo-step-turns 5 --slo-step-warmup-turns 1 \
  --system-prefix-tokens 15000 --input-tokens-per-turn 5000 \
  --output-tokens-per-turn 1000 --max-context-tokens 128000 \
  --backend vllm --reset-cache

How it works:

Geometric ramp (1→2→4→8→…) finds the knee region fast; at each N it runs warmup + measure turns per user and checks P{slo_percentile} TTFT/TPOT.
Binary refine between the last-good and first-bad N pinpoints the exact max.
Optional --slo-error-rate caps the error rate; --slo-step-timeout-s aborts a step if the server is overloaded.
--slo-step-reset-cache (default on) isolates each step; turn it off to test sustained pressure.
Output: a capacity curve (N vs P99 TTFT/TPOT/error/SLO-met) and the Max sustained users verdict.

CLI Options

User Configuration

Option	Default	Description
`--num-users`	1	Total concurrent users
`--user-arrival`	burst	Arrival pattern: `burst`, `steady:<seconds>`, or `poisson:<lambda>`

Context Configuration

Option	Default	Description
`--system-prefix-tokens`	15000	System prefix token count
`--system-prefix-source`	random	Source: `random` or a file path
`--user-prefix-tokens`	5000	Per-user prefix token count
`--input-tokens-per-turn`	5000	Input tokens per turn
`--output-tokens-per-turn`	1000	Output tokens per turn
`--max-context-tokens`	128000	Context window limit
`--compaction-prefix-increment`	5000	User prefix growth on compaction

Run Configuration

Option	Default	Description
`--max-turns`	100	Maximum turns per user

API Configuration

Option	Default	Description
`--endpoint`	(required)	LLM API endpoint URL
`--model`	(required)	Model name
`--api-key`	(empty)	API key
`--tokenizer`	(defaults to model)	Tokenizer path
`--ignore-eos`	True	Ignore EOS token
`--request-timeout`	600	Request timeout in seconds

System Metrics

Option	Default	Description
`--metrics-endpoint`	None	Prometheus metrics URL. Only start+end snapshots are taken.
`--metrics-interval`	5	Polling interval (s) for periodic time-series; only with `--metrics-samples`
`--metrics-samples`	False	Collect periodic metrics throughout the run (extra `/metrics` calls)
`--reset-cache`	False	Evict the server's prefix cache before the start snapshot (`/reset_prefix_cache` for vLLM, `/flush_cache` for SGLang) so the measured hit rate reflects only this benchmark
`--backend`	vllm	Backend: `vllm`, `sglang`, or `mindie`

A pre-flight health check (one tiny request) runs before content generation and aborts early if the endpoint is unreachable, so you don't burn minutes producing an all-error run.

Output

Option	Default	Description
`--output`	results.json	Output JSON file path
`--history`	clawperf_history.jsonl	Append a one-line record (config + summary + per-user aggregates) to this JSONL file on every run, accumulating results across runs. Pass an empty string to disable.

Output Format

Results are saved as JSON with:

{
  "config": { ... },
  "summary": {
    "prefix_cache_token_hit_rate": 0.7981,
    "prefix_cache_hit_tokens_delta": 712012,
    "prefix_cache_query_tokens_delta": 892165,
    "total_compactions": 0,
    ...
  },
  "users": [
    {
      "user_id": 0,
      "aggregate": {
        "total_output_tokens": 3000,
        "ttft": { "avg": 150.2, "P50": 140, "P99": 200 },
        "tpot": { "avg": 3.2, "P50": 3.0, "P99": 5.0 },
        "throughput_tok_s": 12.5,
        "error_count": 0,
        "compaction_count": 2
      },
      "turns": [
        {
          "turn_id": 1,
          "success": true,
          "ttft_ms": 150.2,
          "e2e_latency_ms": 3200.5,
          "tpot_ms": 3.2,
          "input_tokens": 25000,
          "output_tokens": 1000,
          "context_tokens": 25000,
          "compaction_triggered": false,
          "wall_start_ts": 0.016,
          "wall_end_ts": 3.354
        }
      ]
    }
  ],
  "system_metrics": [ ... ],
  "timeline": [ ... ],
  "timing": {
    "setup_time_s": 7.437,
    "bench_time_s": 12.281
  }
}

timing.bench_time_s excludes one-time setup (tokenizer download + content generation); per-turn wall_start_ts/wall_end_ts are offsets from the benchmark start and back the per-user duration/throughput aggregates.

Result History

Every run appends one compact JSON line to clawperf_history.jsonl (configurable via --history, disable with --history ""). Each line carries the run timestamp, the full config, the summary, timing, and per-user aggregates — but not the heavy per-turn arrays, so the file stays queryable as runs accumulate.

Collect and compare results across runs with standard tooling:

# Latest run's hit rate
tail -n1 clawperf_history.jsonl | jq '.summary.prefix_cache_token_hit_rate'

# Throughput trend over all runs
jq -c '{users: .config.num_users, bench_s: .timing.bench_time_s,
        hit_rate: .summary.prefix_cache_token_hit_rate}' clawperf_history.jsonl

The full per-turn detail for any run is still in its --output JSON file, referenced from each history record's output_file field.

Testing Philosophy

ClawPerfBench is designed to simulate the real workload of an Agent system — not single-shot API calls, but sustained multi-turn conversations that push LLM serving backends to their limits.

Why multi-turn matters

Real Agent systems (like OpenClaw) don't send one-off requests. They maintain long conversations: a system prompt, user-specific context, and growing history. Each turn re-sends the entire accumulated context, creating exponentially growing prompts. This is fundamentally different from single-request benchmarks and exposes backend behaviors that single-shot tests miss:

Prefix cache effectiveness: Does the KV-block cache actually reuse tokens across turns? A single-request benchmark can't measure this.
Compaction under load: When context hits the window limit, how does the system handle truncation? Does it recover gracefully or spiral into overflow?
Latency degradation: As context grows from 25K to 200K tokens, TTFT and TPOT change dramatically. Per-turn metrics reveal this progression.
Concurrent pressure: Multiple users with independent conversations create mixed prefix cache states — some sharing the system prefix, others diverging at user-specific paths.

Simulating real users

Each simulated user maintains an independent conversation state with its own growing prefix and history. Users arrive according to configurable patterns (burst, steady, Poisson) — mimicking how real traffic builds up, not an artificial flood of identical requests.

What we measure

What	Why it matters
TTFT per turn	First-token latency grows with context size — the key UX metric for Agent systems
TPOT per turn	Generation speed should stay stable; degradation indicates compute bottlenecks
Prefix cache hit rate	Token-level reuse fraction across turns — the efficiency metric for KV caching
Compaction events	When and how often context overflows — determines conversation continuity
Per-user breakdown	Different users have different prefix paths; aggregate stats hide per-user variance

Context Model

Each user's context follows this structure:

Context model and compaction

When context reaches --max-context-tokens, append-mode compaction fires:

The base context (system + user prefix + input, without history) is checked first. If it already exceeds the limit, compaction is skipped and the turn is marked as context_overflow — this prevents infinite compaction loops.
Otherwise, history is cleared and the user prefix grows by --compaction-prefix-increment tokens.
New random content fills the enlarged user prefix.
If the grown base still exceeds the limit, the prefix growth is reverted (history cleared only) so the user isn't permanently trapped in overflow.

This simulates how real LLM serving systems handle context overflow with prefix caching.

Prefix Cache Simulation

The mock server simulates vLLM's KV-block prefix cache using a trie:

HBM trie: Represents GPU KV cache. Queried first for longest prefix match. Always updated after every request (mimicking vLLM storing all KV blocks regardless of hit/miss).
External trie: Represents CPU/disk prefix cache. Queried on HBM miss. Also always updated after every request.
Token-level hit rate: prefix_cache_hit_tokens / prefix_cache_query_tokens — the fraction of prompt tokens that reuse cached KV blocks. This is the meaningful metric; request-level (binary) hit rate is not reported.
Eviction: When the trie exceeds max_prefixes (200), oldest leaf nodes are evicted.

User Arrival Scheduling

User arrival patterns

burst: All users start immediately
steady:2: Users arrive every 2 seconds
poisson:0.5: Users arrive following a Poisson process with rate 0.5

Architecture

ClawPerf reuses EvalScope's core perf components:

AioHttpClient: Async HTTP with streaming, proper timeout/connector config
OpenaiPlugin: Request building, response parsing, local token counting
BenchmarkData: Single-request data container (TTFT, ITL, E2E timing)
MetricsAccumulator: Real-time metrics aggregation

And adds its own orchestration layer for multi-turn, multi-user workloads.

Key modules:

Module	Role
`cli.py`	Argparse entry point, config creation, runner launch
`config.py`	`BenchmarkConfig` dataclass, arrival mode parsing
`runner.py`	`BenchmarkRunner` orchestrator, user loop, result finalization, JSONL history
`context.py`	`UserContext` context assembly, compaction with infinite-loop guard
`scheduler.py`	Burst/steady/Poisson async generators
`system_metrics.py`	`SystemMetricsPoller` with backend-specific metric mappings
`tokenizer.py`	`TokenizerManager` wrapping ModelScope/HuggingFace tokenizers
`logging_setup.py`	Centralized logging routed through `tqdm.write`
`mock_server.py`	FastAPI mock LLM server with trie-based prefix cache simulation

Development

uv sync --extra dev --extra mock-server
pytest
ruff check

License

Apache License 2.0

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.4.0

Jul 2, 2026

0.3.1

Jun 27, 2026

0.3.0

Jun 27, 2026

0.2.9

Jun 27, 2026

0.2.8

Jun 25, 2026

0.2.7

Jun 25, 2026

0.2.6

Jun 25, 2026

0.2.5

Jun 25, 2026

0.2.4

Jun 25, 2026

0.2.3

Jun 25, 2026

0.2.2

Jun 25, 2026

0.2.1

Jun 25, 2026

0.2.0

Jun 24, 2026

0.1.5

May 30, 2026

0.1.4

May 30, 2026

0.1.3

May 30, 2026

0.1.2

May 30, 2026

0.1.0

May 28, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

clawperf-0.4.0.tar.gz (68.3 kB view details)

Uploaded Jul 2, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

clawperf-0.4.0-py3-none-any.whl (54.9 kB view details)

Uploaded Jul 2, 2026 Python 3

File details

Details for the file clawperf-0.4.0.tar.gz.

File metadata

Download URL: clawperf-0.4.0.tar.gz
Upload date: Jul 2, 2026
Size: 68.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for clawperf-0.4.0.tar.gz
Algorithm	Hash digest
SHA256	`4e7727550a9355eab1162f27a6bbf6eeca0ab12645ffbed063cc765e424c54c0`
MD5	`381d5d68996bb778efc086e8836a4475`
BLAKE2b-256	`de9522023245aa6aef8f3650b3e7dd3f15708aada4889973327393c80f521425`

See more details on using hashes here.

File details

Details for the file clawperf-0.4.0-py3-none-any.whl.

File metadata

Download URL: clawperf-0.4.0-py3-none-any.whl
Upload date: Jul 2, 2026
Size: 54.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for clawperf-0.4.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7bd8a0673aacc183e4c3f69618f20dfb951d0f0e704c369c12ba017924a5bd1e`
MD5	`f46edd11795ab2c69976cca0e0c2dd32`
BLAKE2b-256	`e7dab1dcb6a8f8c8492ca46a279a929b2d163161cf8c78c3f2aa830555e3f749`

See more details on using hashes here.

clawperf 0.4.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ClawPerfBench

Installation

Quick Start

Run a benchmark

Start mock server (for testing)

End-to-end test with mock server

Hit-rate test mode (controlled prefix-cache hit rate)

SLO capacity sweep mode (find max concurrent users)

CLI Options

User Configuration

Context Configuration

Run Configuration

API Configuration

System Metrics

Output

Output Format

Result History

Testing Philosophy

Why multi-turn matters

Simulating real users

What we measure

Context Model

Prefix Cache Simulation

User Arrival Scheduling

Architecture

Development

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes