Skip to main content

Performance benchmarking tool for LLM Serving backends with multi-turn long-context workloads

Project description

ClawPerfBench

Performance benchmarking tool for LLM Serving backends with multi-turn long-context workloads.

Built on EvalScope's perf infrastructure, adding:

  • Multi-turn context model: System Prefix + User Prefix + History + Current Input
  • Append-mode compaction: Clear history, grow user prefix when context reaches limits
  • User arrival scheduling: Burst, steady, or Poisson arrival patterns
  • System metrics polling: Prometheus endpoint support for vLLM, SGLang, MindIE
  • Per-user + per-turn metrics: TTFT, TPOT, ITL with compaction tracking
  • Prefix cache simulation: Trie-based HBM + external prefix cache hit rate tracking in mock server

ClawPerf Benchmark Output

Installation

pip install clawperf

For the mock server used in testing:

pip install clawperf[mock-server]

For development:

pip install clawperf[dev]

Install from source (recommended for development):

git clone https://github.com/Potterluo/ClawPerf.git
cd ClawPerf
uv sync --extra dev --extra mock-server

Quick Start

Run a benchmark

clawperf \
  --endpoint http://localhost:8000/v1/chat/completions \
  --model qwen3-32b \
  --num-users 5 \
  --user-arrival steady:2 \
  --max-turns 10 \
  --output results.json

Start mock server (for testing)

clawperf-mock-server --port 8080

End-to-end test with mock server

# Start mock server
clawperf-mock-server --port 8080

# Run benchmark against it
clawperf \
  --endpoint http://localhost:8080/v1/chat/completions \
  --model Qwen/Qwen2.5-7B-Instruct \
  --tokenizer Qwen/Qwen2.5-7B-Instruct \
  --num-users 4 \
  --max-turns 5 \
  --max-context-tokens 200000 \
  --metrics-endpoint http://localhost:8080/metrics \
  --backend vllm \
  --verbose

CLI Options

User Configuration

Option Default Description
--num-users 1 Total concurrent users
--user-arrival burst Arrival pattern: burst, steady:<seconds>, or poisson:<lambda>

Context Configuration

Option Default Description
--system-prefix-tokens 15000 System prefix token count
--system-prefix-source random Source: random or a file path
--user-prefix-tokens 5000 Per-user prefix token count
--input-tokens-per-turn 5000 Input tokens per turn
--output-tokens-per-turn 1000 Output tokens per turn
--max-context-tokens 128000 Context window limit
--compaction-prefix-increment 5000 User prefix growth on compaction

Run Configuration

Option Default Description
--max-turns 100 Maximum turns per user

API Configuration

Option Default Description
--endpoint (required) LLM API endpoint URL
--model (required) Model name
--api-key (empty) API key
--tokenizer (defaults to model) Tokenizer path
--ignore-eos True Ignore EOS token
--request-timeout 600 Request timeout in seconds

System Metrics

Option Default Description
--metrics-endpoint None Prometheus metrics URL
--metrics-interval 5 Polling interval in seconds
--backend vllm Backend: vllm, sglang, or mindie

Output

Option Default Description
--output results.json Output JSON file path

Output Format

Results are saved as JSON with:

{
  "config": { ... },
  "summary": {
    "prefix_cache_token_hit_rate": 0.7981,
    "prefix_cache_hit_tokens_delta": 712012,
    "prefix_cache_query_tokens_delta": 892165,
    "total_compactions": 0,
    ...
  },
  "users": [
    {
      "user_id": 0,
      "aggregate": {
        "total_output_tokens": 3000,
        "ttft": { "avg": 150.2, "P50": 140, "P99": 200 },
        "tpot": { "avg": 3.2, "P50": 3.0, "P99": 5.0 },
        "throughput_tok_s": 12.5,
        "error_count": 0,
        "compaction_count": 2
      },
      "turns": [
        {
          "turn_id": 1,
          "success": true,
          "ttft_ms": 150.2,
          "e2e_latency_ms": 3200.5,
          "tpot_ms": 3.2,
          "input_tokens": 25000,
          "output_tokens": 1000,
          "context_tokens": 25000,
          "compaction_triggered": false
        }
      ]
    }
  ],
  "system_metrics": [ ... ],
  "timeline": [ ... ]
}

Context Model

Each user's context follows this structure:

[System Prefix] [User Prefix] [History] [Current Input]

When context reaches --max-context-tokens, append-mode compaction fires:

  1. The base context (system + user prefix + input, without history) is checked first. If it already exceeds the limit, compaction is skipped and the turn is marked as context_overflow — this prevents infinite compaction loops.
  2. Otherwise, history is cleared and the user prefix grows by --compaction-prefix-increment tokens.
  3. New random content fills the enlarged user prefix.

This simulates how real LLM serving systems handle context overflow with prefix caching.

Prefix Cache Simulation

The mock server simulates vLLM's KV-block prefix cache using a trie:

  • HBM trie: Represents GPU KV cache. Queried first for longest prefix match. Always updated after every request (mimicking vLLM storing all KV blocks regardless of hit/miss).
  • External trie: Represents CPU/disk prefix cache. Queried on HBM miss. Also always updated after every request.
  • Token-level hit rate: prefix_cache_hit_tokens / prefix_cache_query_tokens — the fraction of prompt tokens that reuse cached KV blocks. This is the meaningful metric; request-level (binary) hit rate is not reported.
  • Eviction: When the trie exceeds max_prefixes (200), oldest leaf nodes are evicted.

User Arrival Scheduling

  • burst: All users start immediately
  • steady:2: Users arrive every 2 seconds
  • poisson:0.5: Users arrive following a Poisson process with rate 0.5

Architecture

ClawPerf reuses EvalScope's core perf components:

  • AioHttpClient: Async HTTP with streaming, proper timeout/connector config
  • OpenaiPlugin: Request building, response parsing, local token counting
  • BenchmarkData: Single-request data container (TTFT, ITL, E2E timing)
  • MetricsAccumulator: Real-time metrics aggregation

And adds its own orchestration layer for multi-turn, multi-user workloads.

Key modules:

Module Role
cli.py Argparse entry point, config creation, runner launch
config.py BenchmarkConfig dataclass, arrival mode parsing
runner.py BenchmarkRunner orchestrator, user loop, result finalization
context.py UserContext context assembly, compaction with infinite-loop guard
scheduler.py Burst/steady/Poisson async generators
system_metrics.py SystemMetricsPoller with backend-specific metric mappings
tokenizer.py TokenizerManager wrapping ModelScope/HuggingFace tokenizers
mock_server.py FastAPI mock LLM server with trie-based prefix cache simulation

Development

uv sync --extra dev --extra mock-server
pytest
ruff check

License

Apache License 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

clawperf-0.1.0.tar.gz (33.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

clawperf-0.1.0-py3-none-any.whl (30.1 kB view details)

Uploaded Python 3

File details

Details for the file clawperf-0.1.0.tar.gz.

File metadata

  • Download URL: clawperf-0.1.0.tar.gz
  • Upload date:
  • Size: 33.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for clawperf-0.1.0.tar.gz
Algorithm Hash digest
SHA256 a247e5e7752661936b864315d32313b605b76f81e1171d6d7b9f5bbf5af599b2
MD5 c9deb27694a0b2a1b31754d3ff6b9ccf
BLAKE2b-256 d791547d905d23d96f6254ee32f7593eca72de96c31822bd3724c3f08fccf88f

See more details on using hashes here.

File details

Details for the file clawperf-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: clawperf-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 30.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for clawperf-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e776614c35e8dfd07f589e741548fe91504a0f64be836ac4795f021f61ac1662
MD5 b1a8f77aab2bce66bc6801162c84f85b
BLAKE2b-256 1e730ea9ff4bf9993e66d3bf4356614b8272b23d1ea38258ee872bd4ca625913

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page