LLM inference benchmarking harness with pluggable backends
Project description
splleed
LLM inference benchmarking with a Python-first API.
Features
- Python API: Write benchmarks as scripts, not config files
- Pluggable backends: vLLM, TGI (more coming)
- Comprehensive metrics: TTFT, ITL, TPOT, throughput, E2E latency
- Statistical rigor: Multiple trials with confidence intervals
- Flexible operation: Connect to existing servers or let splleed manage them
Installation
pip install splleed
For HuggingFace dataset support:
pip install splleed[hf]
Inference engines (vLLM, TGI) are not bundled - install them separately.
Quick Start
import asyncio
from splleed import Benchmark, VLLMConfig, SamplingParams
async def main():
results = await Benchmark(
backend=VLLMConfig(model="Qwen/Qwen2.5-0.5B-Instruct"),
prompts=[
"What is the capital of France?",
"Explain quantum computing briefly.",
],
concurrency=[1, 2, 4],
trials=3,
sampling=SamplingParams(max_tokens=100),
).run()
results.print()
results.save("results.json")
if __name__ == "__main__":
asyncio.run(main())
Connect vs Managed Mode
Managed mode - splleed starts and stops the server:
backend = VLLMConfig(model="Qwen/Qwen2.5-0.5B-Instruct")
Connect mode - use an existing server:
backend = VLLMConfig(
model="Qwen/Qwen2.5-0.5B-Instruct",
endpoint="http://localhost:8000",
)
Using HuggingFace Datasets
from datasets import load_dataset
from splleed import Benchmark, VLLMConfig
async def main():
ds = load_dataset("tatsu-lab/alpaca", split="train")
ds = ds.shuffle(seed=42).select(range(100))
prompts = list(ds["instruction"])
results = await Benchmark(
backend=VLLMConfig(model="Qwen/Qwen2.5-3B-Instruct"),
prompts=prompts,
concurrency=[1, 2, 4, 8],
trials=3,
).run()
results.print()
Backend Configuration
vLLM
from splleed import VLLMConfig
backend = VLLMConfig(
model="meta-llama/Llama-3.1-8B-Instruct",
tensor_parallel=2,
gpu_memory_utilization=0.9,
quantization="awq", # optional
dtype="auto",
)
TGI
from splleed import TGIConfig
backend = TGIConfig(
model="meta-llama/Llama-3.1-8B-Instruct",
quantize="bitsandbytes-nf4", # optional
)
Benchmark Modes
Latency Mode (default)
Sequential requests to measure per-request latency without interference:
Benchmark(..., mode="latency")
Throughput Mode
Concurrent requests to measure maximum throughput:
Benchmark(..., mode="throughput", concurrency=[1, 4, 8, 16])
Serve Mode
Simulate realistic traffic with controlled arrival patterns:
Benchmark(
...,
mode="serve",
arrival_rate=10.0, # 10 requests/sec
arrival_pattern="poisson", # realistic traffic
concurrency=[32], # max concurrent requests
)
Arrival patterns:
poisson- exponential inter-arrival times (realistic web traffic)gamma- configurable burstinessconstant- fixed interval between requests
Benchmark Options
Benchmark(
backend=...,
prompts=["..."],
# Benchmark settings
mode="latency", # "latency", "throughput", or "serve"
concurrency=[1, 4, 8], # concurrency levels to test
warmup=2, # warmup iterations
runs=10, # requests per concurrency level
trials=3, # independent trials for CI
confidence_level=0.95, # confidence interval level
# Serve mode only
arrival_rate=10.0, # requests per second
arrival_pattern="poisson", # "poisson", "gamma", "constant"
# Sampling parameters
sampling=SamplingParams(
max_tokens=100,
temperature=0.0,
top_p=1.0,
),
)
Metrics
| Metric | Description |
|---|---|
| TTFT | Time to first token |
| ITL | Inter-token latency |
| TPOT | Time per output token (mean ITL) |
| E2E | End-to-end request latency |
| Throughput | Tokens/sec |
| Goodput | % of requests meeting SLO |
All latency metrics include p50, p95, p99, and mean. With multiple trials, results include 95% confidence intervals.
Output Formats
results.print() # Rich table to console
results.save("out.json") # JSON format
results.save("out.csv") # CSV format
json_str = results.to_json()
csv_str = results.to_csv()
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file splleed-0.1.0a3.tar.gz.
File metadata
- Download URL: splleed-0.1.0a3.tar.gz
- Upload date:
- Size: 164.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ecbad77e46273d388400002f481d9b8257b15b76097353023ed7b105d9d51aa1
|
|
| MD5 |
59fdd19f5462cbc30dce0398b08726aa
|
|
| BLAKE2b-256 |
3657de3502e51b4084fbd92427cde109f2e84ac1411fa534a2fa0dfd4496826b
|
Provenance
The following attestation bundles were made for splleed-0.1.0a3.tar.gz:
Publisher:
publish.yml on Bradley-Butcher/Splleed
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
splleed-0.1.0a3.tar.gz -
Subject digest:
ecbad77e46273d388400002f481d9b8257b15b76097353023ed7b105d9d51aa1 - Sigstore transparency entry: 781776166
- Sigstore integration time:
-
Permalink:
Bradley-Butcher/Splleed@e07d101d852f939c642a9609e77a361386a849c6 -
Branch / Tag:
refs/tags/v0.1.0a3 - Owner: https://github.com/Bradley-Butcher
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@e07d101d852f939c642a9609e77a361386a849c6 -
Trigger Event:
release
-
Statement type:
File details
Details for the file splleed-0.1.0a3-py3-none-any.whl.
File metadata
- Download URL: splleed-0.1.0a3-py3-none-any.whl
- Upload date:
- Size: 47.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a888ef7111dd8967f9f59ac26cf2a01b795362598401ad6257c9f4b65f07ba1a
|
|
| MD5 |
0bdec37ba9c84931fb8419bb567ea0de
|
|
| BLAKE2b-256 |
e2fa77b0dd58140a7164a74fb5a90e465316b1390ee3719b00243c414daab170
|
Provenance
The following attestation bundles were made for splleed-0.1.0a3-py3-none-any.whl:
Publisher:
publish.yml on Bradley-Butcher/Splleed
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
splleed-0.1.0a3-py3-none-any.whl -
Subject digest:
a888ef7111dd8967f9f59ac26cf2a01b795362598401ad6257c9f4b65f07ba1a - Sigstore transparency entry: 781776180
- Sigstore integration time:
-
Permalink:
Bradley-Butcher/Splleed@e07d101d852f939c642a9609e77a361386a849c6 -
Branch / Tag:
refs/tags/v0.1.0a3 - Owner: https://github.com/Bradley-Butcher
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@e07d101d852f939c642a9609e77a361386a849c6 -
Trigger Event:
release
-
Statement type: