Skip to main content

LLM inference benchmarking harness with pluggable backends

Project description

splleed

LLM inference benchmarking with a Python-first API.

Features

  • Python API: Write benchmarks as scripts, not config files
  • Pluggable backends: vLLM, TGI (more coming)
  • Comprehensive metrics: TTFT, ITL, TPOT, throughput, E2E latency
  • Statistical rigor: Multiple trials with confidence intervals
  • Flexible operation: Connect to existing servers or let splleed manage them

Installation

pip install splleed

For HuggingFace dataset support:

pip install splleed[hf]

Inference engines (vLLM, TGI) are not bundled - install them separately.

Quick Start

import asyncio
from splleed import Benchmark, VLLMConfig, SamplingParams

async def main():
    results = await Benchmark(
        backend=VLLMConfig(model="Qwen/Qwen2.5-0.5B-Instruct"),
        prompts=[
            "What is the capital of France?",
            "Explain quantum computing briefly.",
        ],
        concurrency=[1, 2, 4],
        trials=3,
        sampling=SamplingParams(max_tokens=100),
    ).run()

    results.print()
    results.save("results.json")

if __name__ == "__main__":
    asyncio.run(main())

Connect vs Managed Mode

Managed mode - splleed starts and stops the server:

backend = VLLMConfig(model="Qwen/Qwen2.5-0.5B-Instruct")

Connect mode - use an existing server:

backend = VLLMConfig(
    model="Qwen/Qwen2.5-0.5B-Instruct",
    endpoint="http://localhost:8000",
)

Using HuggingFace Datasets

from datasets import load_dataset
from splleed import Benchmark, VLLMConfig

async def main():
    ds = load_dataset("tatsu-lab/alpaca", split="train")
    ds = ds.shuffle(seed=42).select(range(100))
    prompts = list(ds["instruction"])

    results = await Benchmark(
        backend=VLLMConfig(model="Qwen/Qwen2.5-3B-Instruct"),
        prompts=prompts,
        concurrency=[1, 2, 4, 8],
        trials=3,
    ).run()

    results.print()

Backend Configuration

vLLM

from splleed import VLLMConfig

backend = VLLMConfig(
    model="meta-llama/Llama-3.1-8B-Instruct",
    tensor_parallel=2,
    gpu_memory_utilization=0.9,
    quantization="awq",  # optional
    dtype="auto",
)

TGI

from splleed import TGIConfig

backend = TGIConfig(
    model="meta-llama/Llama-3.1-8B-Instruct",
    quantize="bitsandbytes-nf4",  # optional
)

Benchmark Modes

Latency Mode (default)

Sequential requests to measure per-request latency without interference:

Benchmark(..., mode="latency")

Throughput Mode

Concurrent requests to measure maximum throughput:

Benchmark(..., mode="throughput", concurrency=[1, 4, 8, 16])

Serve Mode

Simulate realistic traffic with controlled arrival patterns:

Benchmark(
    ...,
    mode="serve",
    arrival_rate=10.0,           # 10 requests/sec
    arrival_pattern="poisson",   # realistic traffic
    concurrency=[32],            # max concurrent requests
)

Arrival patterns:

  • poisson - exponential inter-arrival times (realistic web traffic)
  • gamma - configurable burstiness
  • constant - fixed interval between requests

Benchmark Options

Benchmark(
    backend=...,
    prompts=["..."],

    # Benchmark settings
    mode="latency",          # "latency", "throughput", or "serve"
    concurrency=[1, 4, 8],   # concurrency levels to test
    warmup=2,                # warmup iterations
    runs=10,                 # requests per concurrency level
    trials=3,                # independent trials for CI
    confidence_level=0.95,   # confidence interval level

    # Serve mode only
    arrival_rate=10.0,       # requests per second
    arrival_pattern="poisson",  # "poisson", "gamma", "constant"

    # Sampling parameters
    sampling=SamplingParams(
        max_tokens=100,
        temperature=0.0,
        top_p=1.0,
    ),
)

Metrics

Metric Description
TTFT Time to first token
ITL Inter-token latency
TPOT Time per output token (mean ITL)
E2E End-to-end request latency
Throughput Tokens/sec
Goodput % of requests meeting SLO

All latency metrics include p50, p95, p99, and mean. With multiple trials, results include 95% confidence intervals.

Output Formats

results.print()              # Rich table to console
results.save("out.json")     # JSON format
results.save("out.csv")      # CSV format

json_str = results.to_json()
csv_str = results.to_csv()

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

splleed-0.1.0a3.tar.gz (164.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

splleed-0.1.0a3-py3-none-any.whl (47.3 kB view details)

Uploaded Python 3

File details

Details for the file splleed-0.1.0a3.tar.gz.

File metadata

  • Download URL: splleed-0.1.0a3.tar.gz
  • Upload date:
  • Size: 164.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for splleed-0.1.0a3.tar.gz
Algorithm Hash digest
SHA256 ecbad77e46273d388400002f481d9b8257b15b76097353023ed7b105d9d51aa1
MD5 59fdd19f5462cbc30dce0398b08726aa
BLAKE2b-256 3657de3502e51b4084fbd92427cde109f2e84ac1411fa534a2fa0dfd4496826b

See more details on using hashes here.

Provenance

The following attestation bundles were made for splleed-0.1.0a3.tar.gz:

Publisher: publish.yml on Bradley-Butcher/Splleed

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file splleed-0.1.0a3-py3-none-any.whl.

File metadata

  • Download URL: splleed-0.1.0a3-py3-none-any.whl
  • Upload date:
  • Size: 47.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for splleed-0.1.0a3-py3-none-any.whl
Algorithm Hash digest
SHA256 a888ef7111dd8967f9f59ac26cf2a01b795362598401ad6257c9f4b65f07ba1a
MD5 0bdec37ba9c84931fb8419bb567ea0de
BLAKE2b-256 e2fa77b0dd58140a7164a74fb5a90e465316b1390ee3719b00243c414daab170

See more details on using hashes here.

Provenance

The following attestation bundles were made for splleed-0.1.0a3-py3-none-any.whl:

Publisher: publish.yml on Bradley-Butcher/Splleed

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page