A small harness for evaluating OpenAI-compatible inference endpoints with synthetic agentic workloads.
Project description
tokenspeed-trie
trie (trace replay inference evaluation) is a lightweight harness that exercises a running TokenSpeed inference endpoint with synthetic multi-turn workloads derived from production traces.
Prefill-heavy or decode-heavy synthetic benchmarks (1k/8k, 1k/1k, 8k/1k, etc.) don't capture real agentic traffic: it's multi-turn, has high per-turn prefill from tool outputs, and stresses KV-cache management as context grows. trie replays that shape.
Install
pip install tokenspeed-trie
The published package name on PyPI is tokenspeed-trie; the import name and the CLI command stay trie.
Quick start
Three-minute smoke test against a running TokenSpeed endpoint. The example model below is nvidia/Kimi-K2.5-NVFP4; substitute any served model name. Make sure the engine is idle (no leftover traffic from a previous run) before starting.
trie \
workload_path=agentic \
endpoint=http://localhost:8000/v1 \
model=nvidia/Kimi-K2.5-NVFP4 \
tokenizer_model=nvidia/Kimi-K2.5-NVFP4 \
concurrency=8 \
duration=180 \
stream=True \
num_gpus=4
workload_path accepts a short alias (agentic / qa / office) backed by the lightseekorg/trie-dataset mirror on Hugging Face, or a filesystem path to a custom JSONL. Aliases are downloaded into /tmp/trie-dataset/ on first use and reused thereafter. Override the cache directory with TRIE_DATASET_CACHE=/some/other/path if /tmp isn't usable.
model is sent to the inference endpoint. tokenizer_model is loaded separately via transformers.AutoTokenizer.from_pretrained(...) to generate synthetic prompts at the requested token lengths. Pass tokenizer_model= explicitly when model is not a valid Hugging Face ID or local tokenizer path.
stream=True is required to surface TTFT, TTFAT, and Decode TPS.
Start TokenSpeed with --enable-cache-report so the server returns usage.prompt_tokens_details.cached_tokens; without it the Cache hit rate (%) columns are zero.
Full benchmark sweep
Mirrors the methodology in Applied Compute's inference benchmark: three workloads × six concurrency levels × 2-hour runs. Total wall time: ~36 hours on a single node.
ENDPOINT=http://localhost:8000/v1
MODEL=nvidia/Kimi-K2.5-NVFP4
mkdir -p logs
for WL in agentic qa office; do
for C in 8 16 24 32 40 48; do
trie \
workload_path=$WL \
endpoint=$ENDPOINT \
model=$MODEL \
tokenizer_model=$MODEL \
concurrency=$C \
duration=7200 \
stream=True \
num_gpus=4 \
2>&1 | tee logs/${WL}_c${C}.log
done
done
duration is the deadline for launching new traces. Once it elapses, the harness stops admitting work and cancels everything in flight.
Python API
from trie import Client
client = Client(
endpoint="http://localhost:8000/v1",
model="nvidia/Kimi-K2.5-NVFP4",
)
client.sync_run("agentic", concurrency=8, duration=180, stream=True, num_gpus=4)
# Use client.run(...) directly if you're already inside an event loop.
Workload aliases
| Alias | HF dataset filename | Cached path |
|---|---|---|
agentic |
agentic_coding_8k.jsonl |
/tmp/trie-dataset/agentic_coding_8k.jsonl |
qa |
code_qa_8k.jsonl |
/tmp/trie-dataset/code_qa_8k.jsonl |
office |
office_work_8k.jsonl |
/tmp/trie-dataset/office_work_8k.jsonl |
Anything not in the alias table is treated as a filesystem path and passed through unchanged.
Custom workload format
Each JSONL row defines one trace:
num_turns— number of tool-use turnsinput_prompt_length— initial user prompt token lengthassistant_response_length— per-turn assistant tokens (list of lengthnum_turns)tool_call_output_length— per-turn tool result tokens (list of lengthnum_turns)tool_call_latency— per-turn simulated delay in seconds (list of lengthnum_turns)final_assistant_response_length— final assistant response tokens after all tool turns
Example row:
{"num_turns": 2, "input_prompt_length": 32, "assistant_response_length": [16, 20], "tool_call_output_length": [8, 12], "tool_call_latency": [0.0, 0.0], "final_assistant_response_length": 64}
A trace produces num_turns + 1 completion requests: one per tool-use turn, plus a final turn after the last tool result.
Example output
[info ] starting benchmark concurrency=24 duration=300.0 model=nvidia/Kimi-K2.5-NVFP4 num_gpus=4 workload_templates=8192
[info ] benchmark complete completed_requests=… failed_requests=0 wall_time_s=… ...
Per-trace metrics
┏━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ ┃ ┃ ┃ ┃ ┃ ┃ Eligible cache hit rate ┃
┃ Metric ┃ Latency (s) ┃ TTFT (s) ┃ TTFAT (s) ┃ Decode TPS (tok/s) ┃ Cache hit rate (%) ┃ (%) ┃
┡━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ mean │ ... │ ... │ ... │ ... │ ... │ ... │
└────────┴─────────────┴──────────┴───────────┴────────────────────┴────────────────────┴─────────────────────────┘
Workload metrics
completed=N/N trace/s=…
┏━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┓
┃ Metric ┃ Overall ┃ Last 30s Window ┃ Steady State ┃ Steady State / GPU ┃
┡━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━┩
│ total prompt tok/s │ ... │ ... │ ... │ ... │
│ cached prompt tok/s │ ... │ ... │ ... │ ... │
│ uncached prompt tok/s │ ... │ ... │ ... │ ... │
│ completion tok/s │ ... │ ... │ ... │ ... │
└───────────────────────┴──────────┴─────────────────┴──────────────┴────────────────────┘
Metrics
Per-trace
Latency (s)— end-to-end latency from the first request of a trace to the final response.TTFT (s)— (streaming) time to the first streamed token of the first request.TTFAT (s)— (streaming) time from trace start to the first streamed token of the final request. The user-visible first token in an agent that hides intermediate tool turns.Decode TPS (tok/s)— (streaming) mean post-TTFT decode throughput across the trace's requests.Cache hit rate (%)— server-reportedcached_prompt_tokens / prompt_tokensover all requests in a trace.Eligible cache hit rate (%)— same numerator, denominator restricted to prompt tokens expected to be cacheable. Excludes the initial prompt and, on each turn, the tool output newly appended on that request:sum_i cached_prompt_tokens_i / sum_i eligible_prompt_tokens_i, whereeligible_prompt_tokens_0 = 0andeligible_prompt_tokens_i = prompt_tokens_{i-1} + completion_tokens_{i-1}fori > 0.
Workload
trace/s— completed traces per wall-clock second.total / cached / uncached prompt tok/s— aggregate prompt-token throughput, split by what the synthetic workload accounting expects to be cached vs. new.completion tok/s— aggregate completion-token throughput.
Each is reported under four columns:
Overall— totals over the full benchmark wall time.Last 30s Window— slope of cumulative token counts over the most recent 30 seconds.Steady State— the headline throughput metric. Slope after dropping the first 20% of wall time as warmup. Avoids dilution from ramp-up and drain when fewer thanconcurrencytraces are in flight. Withconcurrency > 1the completion curve depends on finish order, so the metric has small run-to-run variance even at fixed seed.Steady State / GPU—Steady State / num_gpuswhennum_gpusis set.
Prompt-token throughputs use synthetic workload accounting; cache-hit metrics use server-reported usage. Divergence implies a tokenizer mismatch between client and server.
Known limitations
- Synthetic prompts are freshly random per trace, so cross-trace prefix sharing (e.g. a common system prompt or tool definitions) is not modeled and cache hit rates may be lower than in a deployment that shares prefixes.
Decode TPSassumes the first streamed chunk carries exactly one token. Backends that buffer multiple tokens into the first chunk overstate it slightly.transformersis unpinned; install the version whose tokenizer matches your inference server's. Mismatched versions can produce subtly different token counts and cause prompt-accounting drift.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tokenspeed_trie-0.1.1.post20260523.tar.gz.
File metadata
- Download URL: tokenspeed_trie-0.1.1.post20260523.tar.gz
- Upload date:
- Size: 14.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ae49ad877890db92aa7b33ff24cdf020c78cafe1414c6f664be55a8245c09e86
|
|
| MD5 |
32faa141f34c4f26dc35a75f4bf0eb81
|
|
| BLAKE2b-256 |
a18d42a68464ccf97359147b1d7f6489c563aee756c329f6570f99954a099ee4
|
Provenance
The following attestation bundles were made for tokenspeed_trie-0.1.1.post20260523.tar.gz:
Publisher:
release.yml on lightseekorg/tokenspeed-trie
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
tokenspeed_trie-0.1.1.post20260523.tar.gz -
Subject digest:
ae49ad877890db92aa7b33ff24cdf020c78cafe1414c6f664be55a8245c09e86 - Sigstore transparency entry: 1613555835
- Sigstore integration time:
-
Permalink:
lightseekorg/tokenspeed-trie@c89fbe5e4e99e787688c7154c2a5a13131168169 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/lightseekorg
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@c89fbe5e4e99e787688c7154c2a5a13131168169 -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file tokenspeed_trie-0.1.1.post20260523-py3-none-any.whl.
File metadata
- Download URL: tokenspeed_trie-0.1.1.post20260523-py3-none-any.whl
- Upload date:
- Size: 17.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
93624632b8d0afd96b9af9e9bd12b33d5af9821c75e2bf2d11f2198aeb4c425f
|
|
| MD5 |
259b6615310888e70f8aca9219b00ade
|
|
| BLAKE2b-256 |
dc9c898e46c501a359520d1c70d672f47ac081b05f170370879b20124a02806d
|
Provenance
The following attestation bundles were made for tokenspeed_trie-0.1.1.post20260523-py3-none-any.whl:
Publisher:
release.yml on lightseekorg/tokenspeed-trie
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
tokenspeed_trie-0.1.1.post20260523-py3-none-any.whl -
Subject digest:
93624632b8d0afd96b9af9e9bd12b33d5af9821c75e2bf2d11f2198aeb4c425f - Sigstore transparency entry: 1613556004
- Sigstore integration time:
-
Permalink:
lightseekorg/tokenspeed-trie@c89fbe5e4e99e787688c7154c2a5a13131168169 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/lightseekorg
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@c89fbe5e4e99e787688c7154c2a5a13131168169 -
Trigger Event:
workflow_dispatch
-
Statement type: