Skip to main content

Production-quality performance validation and regression testing for LLM inference systems

Project description

llmtest-perf

Production-quality performance validation and regression testing for LLM inference systems.

GitHub stars PyPI License: MIT Python 3.9+

Install: pip install llmtest-perf PyPI: https://pypi.org/project/llmtest-perf/


What is llmtest-perf?

llmtest-perf is a pytest-like performance validation framework for LLM inference systems. It helps engineering teams answer critical questions before deploying model or infrastructure changes:

  • Did latency regress after upgrading to a new model version?
  • Did throughput improve or degrade with the new runtime configuration?
  • What happened to TTFT (time to first token) and token generation speed?
  • How does the system behave under realistic mixed workloads?
  • Is this deployment safe to promote based on our SLOs?

This is not a generic benchmark tool. It's a release-gating and regression-testing framework designed for CI/CD pipelines and production deployment validation.


Key Features

  • Workload-aware testing - Define realistic mixed workloads with weighted prompt sets
  • CI-friendly - Pass/fail based on SLOs and regression thresholds
  • Comparison-first - Built-in baseline vs candidate comparison mode
  • Developer-friendly - Declarative YAML configs, rich console output
  • Practical metrics - P50/P90/P95/P99 latency, TTFT, throughput, error rates
  • Async engine - High-performance httpx-based async workload runner
  • Multiple outputs - Console, JSON, and self-contained HTML reports
  • Extensible - Clean provider abstraction (OpenAI-compatible first)

Who is this for?

llmtest-perf is built for:

  • ML Engineers validating model performance before deployment
  • Infrastructure Engineers testing LLM serving optimizations
  • Platform Teams implementing SLO-based release gates
  • DevOps/SRE teams running performance regression tests in CI

Installation

# Install from source
cd llmtest-perf
pip install -e .

# Install with dev dependencies
pip install -e ".[dev]"

Requirements:

  • Python 3.11+
  • API access to an OpenAI-compatible endpoint

Quick Start

1. Initialize a demo config

llmtest-perf init demo

This creates demo.yaml with example configuration.

2. Edit the config

Update the endpoints and API keys:

targets:
  baseline:
    base_url: "http://localhost:8000/v1"
    model: "your-model"
    api_key_env: "OPENAI_API_KEY"

  candidate:
    base_url: "http://localhost:8001/v1"
    model: "your-model-v2"
    api_key_env: "OPENAI_API_KEY"

3. Run a single target

export OPENAI_API_KEY="your-key"
llmtest-perf run demo.yaml --target baseline

4. Run baseline vs candidate comparison

llmtest-perf compare demo.yaml

Configuration

Full Configuration Example

provider: openai_compatible

targets:
  baseline:
    base_url: "http://localhost:8000/v1"
    model: "gpt-3.5-turbo"
    api_key_env: "OPENAI_API_KEY"

  candidate:
    base_url: "http://localhost:8001/v1"
    model: "gpt-4-turbo"
    api_key_env: "OPENAI_API_KEY"

workload:
  duration_seconds: 60
  max_concurrency: 32
  ramp_up_seconds: 10
  stream: true

  prompt_sets:
    - name: short_qa
      weight: 40
      prompts:
        - "What is the capital of France?"
        - "Explain TCP vs UDP briefly."

    - name: long_context
      weight: 30
      prompts:
        - "Summarize this architecture document: ..."

    - name: structured_output
      weight: 30
      prompts:
        - "Return JSON with keys: summary, sentiment for this text."

request:
  max_tokens: 256
  temperature: 0.0
  timeout_seconds: 60

slos:
  p95_latency_ms: 2500
  ttft_ms: 1200
  output_tokens_per_sec: 40
  error_rate_percent: 1.0

comparison:
  fail_on_regression: true
  max_p95_latency_regression_percent: 10
  max_ttft_regression_percent: 10
  max_output_tokens_per_sec_drop_percent: 10
  max_error_rate_increase_percent: 1

reporting:
  json: "artifacts/results.json"
  html: "artifacts/report.html"
  console: true

seed: 42  # For reproducible workloads

Key Configuration Sections

Targets - Define baseline and/or candidate deployments

  • base_url: OpenAI-compatible endpoint
  • model: Model identifier
  • api_key_env: Environment variable for API key

Workload - Configure test execution

  • duration_seconds: How long to run the test
  • max_concurrency: Maximum concurrent requests
  • ramp_up_seconds: Gradual ramp-up period
  • stream: Use streaming responses (captures TTFT)
  • prompt_sets: Weighted collections of prompts

Request - LLM parameters

  • max_tokens, temperature, timeout_seconds, top_p

SLOs - Absolute thresholds (optional)

  • p95_latency_ms, ttft_ms, output_tokens_per_sec, error_rate_percent

Comparison - Regression detection (optional)

  • fail_on_regression: Exit with error on regression
  • max_*_regression_percent: Allowed regression thresholds

Reporting - Output configuration

  • json, html, console

CLI Commands

run - Run performance test

# Run all targets
llmtest-perf run config.yaml

# Run specific target
llmtest-perf run config.yaml --target baseline

# Suppress console output
llmtest-perf run config.yaml --quiet

Outputs:

  • Console summary (if enabled)
  • JSON results (if configured)
  • HTML report (if configured)
  • Exit code 0 on success, 1 on SLO violation

compare - Baseline vs candidate comparison

llmtest-perf compare config.yaml

Requires: Both baseline and candidate targets in config

Outputs:

  • Side-by-side comparison table
  • Regression/improvement detection
  • Final verdict (PASS/FAIL)
  • Exit code 0 if no regressions, 1 if regressions detected

validate - Validate configuration

llmtest-perf validate config.yaml

Checks YAML syntax and schema validation without running tests.

init - Initialize demo config

llmtest-perf init demo

Creates a demo configuration file to get started.


Sample Output

Console Output (Run)

Results for: baseline

┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓
┃ Metric           ┃ Value  ┃
┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩
│ Total Requests   │ 1,847  │
│ Successful       │ 1,842  │
│ Failed           │ 5      │
│ Error Rate       │ 0.27%  │
│ Duration         │ 60.12s │
│ Throughput       │ 30.7/s │
│ Token Throughput │ 45.2/s │
└──────────────────┴────────┘

┏━━━━━━━━━━━━┳━━━━━━━━━┓
┃ Percentile ┃ Value   ┃
┡━━━━━━━━━━━━╇━━━━━━━━━┩
│ P50        │ 1,842ms │
│ P90        │ 2,104ms │
│ P95        │ 2,287ms │
│ P99        │ 2,891ms │
└────────────┴─────────┘

Console Output (Compare)

Comparison: baseline vs candidate

┏━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━┓
┃ Metric             ┃ Baseline ┃ Candidate ┃ Delta  ┃ Status      ┃
┡━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━┩
│ P95 Latency (ms)   │ 2287.45  │ 2612.38   │ +14.2% │ REGRESSION  │
│ TTFT (ms)          │ 1142.22  │ 1046.11   │ -8.4%  │ IMPROVEMENT │
│ Output Tokens/Sec  │ 45.23    │ 39.76     │ -12.1% │ REGRESSION  │
│ Error Rate (%)     │ 0.27     │ 0.32      │ +18.5% │ OK          │
└────────────────────┴──────────┴───────────┴────────┴─────────────┘

╭─────────────────────────────────────────╮
│              FAIL                       │
│                                         │
│ Performance regression detected (2      │
│ metrics)                                │
│                                         │
│ Recommendation: DO NOT PROMOTE - Fix    │
│ regressions before deploying            │
╰─────────────────────────────────────────╯

Use Cases

1. Model Version Validation

Compare inference performance between model versions:

targets:
  baseline:
    base_url: "https://api.example.com/v1"
    model: "gpt-3.5-turbo"

  candidate:
    base_url: "https://api.example.com/v1"
    model: "gpt-4-turbo"

Run: llmtest-perf compare config.yaml

2. Infrastructure Change Validation

Test serving optimizations or hardware changes:

targets:
  baseline:
    base_url: "http://old-cluster.internal/v1"
    model: "llama-2-70b"

  candidate:
    base_url: "http://new-cluster.internal/v1"
    model: "llama-2-70b"

3. SLO Compliance Monitoring

Gate deployments on SLO compliance:

slos:
  p95_latency_ms: 2000
  ttft_ms: 800
  error_rate_percent: 0.5

comparison:
  fail_on_regression: true

Integrate into CI: fails if SLOs violated or regressions detected.

4. Realistic Workload Testing

Model production traffic patterns:

prompt_sets:
  - name: quick_questions
    weight: 60
    prompts: [...]

  - name: complex_analysis
    weight: 30
    prompts: [...]

  - name: code_generation
    weight: 10
    prompts: [...]

CI/CD Integration

GitHub Actions Example

name: LLM Performance Tests

on:
  pull_request:
    paths:
      - 'models/**'
      - 'infrastructure/**'

jobs:
  perf-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3

      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.11'

      - name: Install llmtest-perf
        run: |
          pip install -e .

      - name: Run performance comparison
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: |
          llmtest-perf compare .github/perf-config.yaml

      - name: Upload reports
        if: always()
        uses: actions/upload-artifact@v3
        with:
          name: perf-reports
          path: artifacts/

Architecture

Components

llmtest-perf/
├── config/          # YAML config loading and validation (Pydantic)
├── providers/       # LLM provider abstraction (OpenAI-compatible)
├── engine/          # Async workload runner, metrics, scheduling
├── compare/         # Baseline vs candidate comparison logic
├── reporting/       # Console, JSON, HTML report generation
└── cli.py           # Typer-based CLI

Workload Engine

  • Async I/O: httpx-based for high concurrency
  • Concurrency control: Semaphore-based with configurable limits
  • Ramp-up: Linear ramp-up to avoid cold-start bias
  • Streaming: Captures TTFT when streaming is enabled
  • Reproducibility: Optional random seed for deterministic workloads

Metrics Collection

Per-request:

  • End-to-end latency
  • Time to first token (TTFT) for streaming
  • Input/output token counts
  • Error type and message

Aggregated:

  • P50/P90/P95/P99 percentiles
  • Mean, min, max
  • Request throughput (req/sec)
  • Token throughput (tok/sec)
  • Error rate percentage
  • Per-prompt-set breakdown

Limitations

Current Version

  • OpenAI-compatible only: Currently supports OpenAI Chat Completions API format
  • HTTP-based: No gRPC or other protocols yet
  • Single-region: No multi-region testing
  • Token counting: Relies on provider-reported token counts
  • No payload size limits: Large prompts may need external file support

Planned Improvements

See Roadmap section below.


Roadmap

Future enhancements under consideration:

  • Additional providers: Anthropic, Google, AWS Bedrock, Azure OpenAI
  • Advanced workloads: Prompt templates, external prompt files, payload generation
  • Cost tracking: Track API costs alongside performance
  • Historical trending: Track metrics over time
  • Warmup phase: Pre-test warmup to eliminate cold starts
  • Custom metrics: Plugin system for custom metric collection
  • Distributed testing: Multi-node workload generation
  • Real-time monitoring: Live dashboard during test runs

Development

Setup

# Clone repo
git clone <repo-url>
cd llmtest-perf

# Install in development mode
pip install -e ".[dev]"

Run Tests

# Run all tests
pytest

# With coverage
pytest --cov=llmtest_perf --cov-report=html

# Type checking
mypy src/llmtest_perf

# Linting
ruff check src/

Project Structure

llmtest-perf/
├── src/llmtest_perf/      # Main package
│   ├── config/            # Configuration models
│   ├── engine/            # Workload execution
│   ├── providers/         # Provider implementations
│   ├── compare/           # Comparison logic
│   ├── reporting/         # Report generation
│   └── cli.py             # CLI interface
├── tests/                 # Test suite
├── examples/              # Example configs
├── artifacts/             # Generated reports (gitignored)
└── README.md

Contributing

Contributions are welcome! Areas of interest:

  • Additional provider implementations
  • Performance optimizations
  • Additional metrics and analysis
  • Documentation improvements
  • Bug reports and fixes

License

MIT License - see LICENSE file for details.


Why llmtest-perf?

llmtest-perf complements llmtest (a correctness/safety testing framework) by focusing exclusively on performance validation. Together, they provide comprehensive testing for LLM applications:

  • llmtest: Grounding, safety, prompt injection, behavioral regression
  • llmtest-perf: Latency, throughput, TTFT, performance regression

Use both for complete release validation.


Built for engineers who ship LLM systems to production.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llmtest_perf-0.1.1.tar.gz (32.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llmtest_perf-0.1.1-py3-none-any.whl (31.3 kB view details)

Uploaded Python 3

File details

Details for the file llmtest_perf-0.1.1.tar.gz.

File metadata

  • Download URL: llmtest_perf-0.1.1.tar.gz
  • Upload date:
  • Size: 32.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.18

File hashes

Hashes for llmtest_perf-0.1.1.tar.gz
Algorithm Hash digest
SHA256 c1dfa01bf818bc65533099d22747e7a4f99a35ebbaf9ea64a2959e35b99b3c30
MD5 30812d3e22df66c622050b473748aebb
BLAKE2b-256 68c55c97de025293239a8324f5e8d9cb52dbd07ca26955df22cf6736cb21571c

See more details on using hashes here.

File details

Details for the file llmtest_perf-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: llmtest_perf-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 31.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.18

File hashes

Hashes for llmtest_perf-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 619490a081905a984ad2935bdecf626f61b4c3050e53b3c780023fb3cb37ed7c
MD5 3b7c8cd422b54176e680c6b46e83e2f5
BLAKE2b-256 71a131f5d52fac97b05c2c4f6667aad3d588f4a0a23754f53a4efcfb49193a62

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page