Capacity planning for reserved LLM throughput: latency, headroom, and synthetic workload simulation.
Project description
slosizer: Right-size reserved LLM capacity Based on SLO
slosizer is a small Python package for sizing reserved LLM capacity against either a throughput objective or a latency SLO.
It takes request traces, converts them into provider-specific capacity work, simulates queueing under bursty arrivals, and tells you how many reserved units you should buy plus how much slack capacity you are likely to carry.
The package is built for the extremely normal situation where:
- you know your request shape better than your vendor calculator does,
- you care about p95 or p99 latency, not just average throughput,
- and you do not want your capacity plan to be a sacred spreadsheet that nobody trusts.
What problem does this solve?
Reserved-capacity systems like GSU/PTU are fundamentally throughput constructs, but production teams usually care about latency SLOs, burst risk, and headroom.
slosizer gives you one place to:
- ingest raw request logs into a canonical
RequestTrace, - turn requests into provider-specific adjusted work,
- plan capacity for either:
- throughput: control overload probability or required-unit percentile,
- latency: satisfy p95/p99 queue-aware latency targets,
- quantify:
- spare capacity,
- overload probability,
- expected overflow,
- optimization benefit.
What does "ingest" mean here?
It just means map your raw logs into the clean columns the planner expects.
If your real data has columns like timestamp, prompt_tokens, completion_tokens, reasoning_tokens, and cache_hit_tokens, the package standardizes that into a canonical RequestTrace with:
arrival_sclass_nameinput_tokenscached_input_tokensoutput_tokensthinking_tokensmax_output_tokensobserved_latency_s
That is all. No incense. No chanting.
Quickstart
1) Create the environment with uv
uv sync --all-groups
2) Run the shipped synthetic demo
uv run python examples/quickstart.py
This writes:
examples/output/comparison.csvexamples/output/latency_vs_capacity.pngexamples/output/required_units_distribution.pngexamples/output/scenario_benefit.pngexamples/output/percentile_tradeoff.png
3) Run the checks
uv run pytest -q
uv run ruff check src tests examples
uv run ruff format --check src tests examples
uv run deptry .
uv run vulture
Install and use it on your own trace
Minimal latency-oriented example
import pandas as pd
import slosizer as slz
df = pd.read_csv("requests.csv")
trace = slz.from_dataframe(
df,
schema=slz.RequestSchema(
time_col="timestamp",
class_col="route",
input_tokens_col="prompt_tokens",
cached_input_tokens_col="cached_prompt_tokens",
output_tokens_col="completion_tokens",
thinking_tokens_col="reasoning_tokens",
max_output_tokens_col="max_output_tokens",
latency_col="latency_s",
),
provider="vertex",
model="gemini-2.0-flash-001",
)
profile = slz.vertex_profile("gemini-2.0-flash-001")
result = slz.plan_capacity(
trace,
profile,
slz.LatencyTarget(
slz.LatencySLO(
threshold_s=1.5,
percentile=0.99,
metric="e2e",
)
),
)
print(result.recommended_units)
print(result.metrics)
Throughput-oriented example
import slosizer as slz
trace = slz.make_synthetic_trace(seed=42)
profile = slz.vertex_profile("gemini-2.0-flash-001")
result = slz.plan_capacity(
trace,
profile,
slz.ThroughputTarget(
percentile=0.99,
max_overload_probability=0.01,
windows_s=(1.0, 5.0, 30.0),
),
)
print(result.recommended_units)
print(result.slack_summary)
Azure PTU example
Azure support is calibration-first: you seed a profile from the Azure calculator and benchmark results, then use the same planning machinery.
import slosizer as slz
profile = slz.azure_profile(
"gpt-4.1",
throughput_per_unit=12000.0,
input_weight=1.0,
output_weight=4.0,
thinking_weight=4.0,
)
What data do you need?
You can start with only these three fields:
timestampinput_tokensoutput_tokens
You get better plans when you also provide:
cached_input_tokensthinking_tokensmax_output_tokensclass_namelatency_s
See the concrete schema guide in docs/data-requirements.md.
There are also example input files in:
examples/input/synthetic_request_trace_baseline.csvexamples/input/synthetic_request_trace_optimized.csv
Built-in provider support
Vertex GSU
The package ships a small built-in registry for a handful of Vertex models, including:
gemini-2.0-flash-001gemini-2.0-flash-lite-001gemini-2.5-flashgemini-2.5-flash-litegemini-2.5-progemini-3.1-flash-lite-preview
Azure PTU
Azure PTU support is user-calibrated on purpose. The package gives you the same planning engine, but you provide the model-specific PTU profile from your calculator + benchmark loop.
See docs/provider-adapters.md.
Synthetic demo: what it shows
The repo ships with a fake but bursty workload containing three classes:
- chat
- rag
- reasoning
The optimized variant simulates:
- tighter prompts,
- more caching,
- shorter outputs,
- lower thinking-token budgets.
That lets you inspect two things immediately:
- Optimization can reduce reserved-capacity needs.
- Planning for stricter percentiles usually increases slack capacity.
Snapshot of the current synthetic outputs
| scenario | objective | target | recommended units | avg spare fraction (1s) | overload probability (1s) | achieved latency quantile |
|---|---|---|---|---|---|---|
| baseline | latency | p95 <= 1.5s | 5 | 0.718 | 0.030 | 1.315s |
| baseline | latency | p99 <= 1.5s | 7 | 0.794 | 0.006 | 1.428s |
| baseline | throughput | p99 units, overload <= 1% | 7 | 0.794 | 0.006 | - |
| optimized | latency | p95 <= 1.5s | 4 | 0.713 | 0.032 | 1.157s |
| optimized | latency | p99 <= 1.5s | 5 | 0.766 | 0.012 | 1.278s |
| optimized | throughput | p99 units, overload <= 1% | 6 | 0.804 | 0.005 | - |
These numbers are synthetic. They are there to show the mechanics, not to cosplay as your production traffic.
Output plots
Latency vs provisioned capacity
Distribution of required reserved units
Optimization benefit
Slack trade-off
Repo map
docs/formalization.md: generic throughput/latency modeldocs/data-requirements.md: what columns you need and whydocs/provider-adapters.md: how GSU/PTU adaptation worksdocs/examples.md: the synthetic walkthroughexamples/quickstart.py: reproducible demo script
Caveats
- The queue model is intentionally simple: FCFS fluid queueing, not a perfect service simulator.
- Built-in Vertex profiles are text-centric. Multimodal traffic needs more columns and weights.
- Azure PTU math is workload-sensitive, so the package does not fake vendor-authoritative PTU values for you.
- If you do not have a latency column, the package falls back to a simple token-based baseline latency model. That is a starting point, not gospel.
Name
The package name is slosizer because "how many units do I need, and how much empty air am I buying to hit p99?" is the real question under all the vendor jargon.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file slosizer-0.2.0.tar.gz.
File metadata
- Download URL: slosizer-0.2.0.tar.gz
- Upload date:
- Size: 1.1 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7a933ba69fb34abb9dc54922655872d6d34ecccbc9ce427785a7f4152d1a0a24
|
|
| MD5 |
41500952fa3249701f0106e85b18c83e
|
|
| BLAKE2b-256 |
e3b6b9b78cc8056b667e80d123834cf93855febcabc43e184c388fd04958a61e
|
Provenance
The following attestation bundles were made for slosizer-0.2.0.tar.gz:
Publisher:
python-publish.yml on gojiplus/slosizer
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
slosizer-0.2.0.tar.gz -
Subject digest:
7a933ba69fb34abb9dc54922655872d6d34ecccbc9ce427785a7f4152d1a0a24 - Sigstore transparency entry: 1047517055
- Sigstore integration time:
-
Permalink:
gojiplus/slosizer@6577bd96bcd61b54f94fc93a42e5598eda5ae28a -
Branch / Tag:
refs/heads/main - Owner: https://github.com/gojiplus
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@6577bd96bcd61b54f94fc93a42e5598eda5ae28a -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file slosizer-0.2.0-py3-none-any.whl.
File metadata
- Download URL: slosizer-0.2.0-py3-none-any.whl
- Upload date:
- Size: 24.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a77845d2756efc7bd0e9d6abeffcc33832ec9fb92ffe3ef4f2755ce514600b30
|
|
| MD5 |
2514be5d638376f55c480ac70232fa05
|
|
| BLAKE2b-256 |
d7b14688bb1430f642db168fc1a40a3d386a689ff15cdb920544d87ffab563ad
|
Provenance
The following attestation bundles were made for slosizer-0.2.0-py3-none-any.whl:
Publisher:
python-publish.yml on gojiplus/slosizer
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
slosizer-0.2.0-py3-none-any.whl -
Subject digest:
a77845d2756efc7bd0e9d6abeffcc33832ec9fb92ffe3ef4f2755ce514600b30 - Sigstore transparency entry: 1047517173
- Sigstore integration time:
-
Permalink:
gojiplus/slosizer@6577bd96bcd61b54f94fc93a42e5598eda5ae28a -
Branch / Tag:
refs/heads/main - Owner: https://github.com/gojiplus
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@6577bd96bcd61b54f94fc93a42e5598eda5ae28a -
Trigger Event:
workflow_dispatch
-
Statement type: