Skip to main content

LLM inference benchmarking harness with pluggable backends

Project description

splleed

LLM inference benchmarking with a Python-first API.

Features

  • Python API: Write benchmarks as scripts, not config files
  • Pluggable backends: vLLM, TGI (more coming)
  • Comprehensive metrics: TTFT, ITL, TPOT, throughput, E2E latency
  • Statistical rigor: Multiple trials with confidence intervals
  • Flexible operation: Connect to existing servers or let splleed manage them

Installation

pip install splleed

For HuggingFace dataset support:

pip install splleed[hf]

Inference engines (vLLM, TGI) are not bundled - install them separately.

Quick Start

import asyncio
from splleed import Benchmark, VLLMConfig, SamplingParams

async def main():
    results = await Benchmark(
        backend=VLLMConfig(model="Qwen/Qwen2.5-0.5B-Instruct"),
        prompts=[
            "What is the capital of France?",
            "Explain quantum computing briefly.",
        ],
        concurrency=[1, 2, 4],
        trials=3,
        sampling=SamplingParams(max_tokens=100),
    ).run()

    results.print()
    results.save("results.json")

if __name__ == "__main__":
    asyncio.run(main())

Connect vs Managed Mode

Managed mode - splleed starts and stops the server:

backend = VLLMConfig(model="Qwen/Qwen2.5-0.5B-Instruct")

Connect mode - use an existing server:

backend = VLLMConfig(
    model="Qwen/Qwen2.5-0.5B-Instruct",
    endpoint="http://localhost:8000",
)

Using HuggingFace Datasets

from datasets import load_dataset
from splleed import Benchmark, VLLMConfig

async def main():
    ds = load_dataset("tatsu-lab/alpaca", split="train")
    ds = ds.shuffle(seed=42).select(range(100))
    prompts = list(ds["instruction"])

    results = await Benchmark(
        backend=VLLMConfig(model="Qwen/Qwen2.5-3B-Instruct"),
        prompts=prompts,
        concurrency=[1, 2, 4, 8],
        trials=3,
    ).run()

    results.print()

Backend Configuration

vLLM

from splleed import VLLMConfig

backend = VLLMConfig(
    model="meta-llama/Llama-3.1-8B-Instruct",
    tensor_parallel=2,
    gpu_memory_utilization=0.9,
    quantization="awq",  # optional
    dtype="auto",
)

TGI

from splleed import TGIConfig

backend = TGIConfig(
    model="meta-llama/Llama-3.1-8B-Instruct",
    quantize="bitsandbytes-nf4",  # optional
)

Benchmark Options

Benchmark(
    backend=...,
    prompts=["..."],

    # Benchmark settings
    mode="latency",          # "latency", "throughput", or "serve"
    concurrency=[1, 4, 8],   # concurrency levels to test
    warmup=2,                # warmup iterations
    runs=10,                 # requests per concurrency level
    trials=3,                # independent trials for CI
    confidence_level=0.95,   # confidence interval level

    # Sampling parameters
    sampling=SamplingParams(
        max_tokens=100,
        temperature=0.0,
        top_p=1.0,
    ),
)

Metrics

Metric Description
TTFT Time to first token
ITL Inter-token latency
TPOT Time per output token (mean ITL)
E2E End-to-end request latency
Throughput Tokens/sec
Goodput % of requests meeting SLO

All latency metrics include p50, p95, p99, and mean. With multiple trials, results include 95% confidence intervals.

Output Formats

results.print()              # Rich table to console
results.save("out.json")     # JSON format
results.save("out.csv")      # CSV format

json_str = results.to_json()
csv_str = results.to_csv()

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

splleed-0.1.0a2.tar.gz (163.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

splleed-0.1.0a2-py3-none-any.whl (45.8 kB view details)

Uploaded Python 3

File details

Details for the file splleed-0.1.0a2.tar.gz.

File metadata

  • Download URL: splleed-0.1.0a2.tar.gz
  • Upload date:
  • Size: 163.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for splleed-0.1.0a2.tar.gz
Algorithm Hash digest
SHA256 24f111c3fc1b7bc36db1307ae80aa9f8328b86dc257c1a5cbadc071a7079868c
MD5 ddaef50129e64a97c598d87b58797562
BLAKE2b-256 859dee7208e0a26d4a6a6da9e2ab728afdece48cfd8e4ad717a83eba473660e3

See more details on using hashes here.

Provenance

The following attestation bundles were made for splleed-0.1.0a2.tar.gz:

Publisher: publish.yml on Bradley-Butcher/Splleed

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file splleed-0.1.0a2-py3-none-any.whl.

File metadata

  • Download URL: splleed-0.1.0a2-py3-none-any.whl
  • Upload date:
  • Size: 45.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for splleed-0.1.0a2-py3-none-any.whl
Algorithm Hash digest
SHA256 8005fd9d4525221e714598aceba4d768f461df9cfa66b00170f98e5716cc13d6
MD5 bb9a984b4507fb45f2730e9f92853524
BLAKE2b-256 0de4e14b37dd2a224cbeaf99c4e65ac9d4507dbe944c7655b5a3387a6e006551

See more details on using hashes here.

Provenance

The following attestation bundles were made for splleed-0.1.0a2-py3-none-any.whl:

Publisher: publish.yml on Bradley-Butcher/Splleed

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page