Skip to main content

A small harness for evaluating OpenAI-compatible inference endpoints with synthetic agentic workloads.

Project description

tokenspeed-trie

trie (trace replay inference evaluation) is a lightweight harness that exercises a running TokenSpeed inference endpoint with synthetic multi-turn workloads derived from production traces.

Prefill-heavy or decode-heavy synthetic benchmarks (1k/8k, 1k/1k, 8k/1k, etc.) don't capture real agentic traffic: it's multi-turn, has high per-turn prefill from tool outputs, and stresses KV-cache management as context grows. trie replays that shape.

Install

pip install tokenspeed-trie

The published package name on PyPI is tokenspeed-trie; the import name and the CLI command stay trie.

Quick start

Three-minute smoke test against a running TokenSpeed endpoint. The example model below is nvidia/Kimi-K2.5-NVFP4; substitute any served model name. Make sure the engine is idle (no leftover traffic from a previous run) before starting.

trie \
  workload_path=agentic \
  endpoint=http://localhost:8000/v1 \
  model=nvidia/Kimi-K2.5-NVFP4 \
  tokenizer_model=nvidia/Kimi-K2.5-NVFP4 \
  concurrency=8 \
  duration=180 \
  stream=True \
  num_gpus=4

workload_path accepts a short alias (agentic / qa / office) backed by the lightseekorg/trie-dataset mirror on Hugging Face, or a filesystem path to a custom JSONL. Aliases are downloaded into /tmp/trie-dataset/ on first use and reused thereafter. Override the cache directory with TRIE_DATASET_CACHE=/some/other/path if /tmp isn't usable.

model is sent to the inference endpoint. tokenizer_model is loaded separately via transformers.AutoTokenizer.from_pretrained(...) to generate synthetic prompts at the requested token lengths. Pass tokenizer_model= explicitly when model is not a valid Hugging Face ID or local tokenizer path.

stream=True is required to surface TTFT, TTFAT, and Decode TPS.

Start TokenSpeed with --enable-cache-report so the server returns usage.prompt_tokens_details.cached_tokens; without it the Cache hit rate (%) columns are zero.

Full benchmark sweep

Mirrors the methodology in Applied Compute's inference benchmark: three workloads × six concurrency levels × 2-hour runs. Total wall time: ~36 hours on a single node.

ENDPOINT=http://localhost:8000/v1
MODEL=nvidia/Kimi-K2.5-NVFP4

mkdir -p logs
for WL in agentic qa office; do
  for C in 8 16 24 32 40 48; do
    trie \
      workload_path=$WL \
      endpoint=$ENDPOINT \
      model=$MODEL \
      tokenizer_model=$MODEL \
      concurrency=$C \
      duration=7200 \
      stream=True \
      num_gpus=4 \
      2>&1 | tee logs/${WL}_c${C}.log
  done
done

duration is the deadline for launching new traces. Once it elapses, the harness stops admitting work and cancels everything in flight.

Python API

from trie import Client

client = Client(
    endpoint="http://localhost:8000/v1",
    model="nvidia/Kimi-K2.5-NVFP4",
)
client.sync_run("agentic", concurrency=8, duration=180, stream=True, num_gpus=4)
# Use client.run(...) directly if you're already inside an event loop.

Workload aliases

Alias HF dataset filename Cached path
agentic agentic_coding_8k.jsonl /tmp/trie-dataset/agentic_coding_8k.jsonl
qa code_qa_8k.jsonl /tmp/trie-dataset/code_qa_8k.jsonl
office office_work_8k.jsonl /tmp/trie-dataset/office_work_8k.jsonl

Anything not in the alias table is treated as a filesystem path and passed through unchanged.

Custom workload format

Each JSONL row defines one trace:

  • num_turns — number of tool-use turns
  • input_prompt_length — initial user prompt token length
  • assistant_response_length — per-turn assistant tokens (list of length num_turns)
  • tool_call_output_length — per-turn tool result tokens (list of length num_turns)
  • tool_call_latency — per-turn simulated delay in seconds (list of length num_turns)
  • final_assistant_response_length — final assistant response tokens after all tool turns

Example row:

{"num_turns": 2, "input_prompt_length": 32, "assistant_response_length": [16, 20], "tool_call_output_length": [8, 12], "tool_call_latency": [0.0, 0.0], "final_assistant_response_length": 64}

A trace produces num_turns + 1 completion requests: one per tool-use turn, plus a final turn after the last tool result.

Example output

[info     ] starting benchmark             concurrency=24 duration=300.0 model=nvidia/Kimi-K2.5-NVFP4 num_gpus=4 workload_templates=8192
[info     ] benchmark complete             completed_requests=… failed_requests=0 wall_time_s=… ...

                                                 Per-trace metrics
┏━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃        ┃             ┃          ┃           ┃                    ┃                    ┃ Eligible cache hit rate ┃
┃ Metric ┃ Latency (s) ┃ TTFT (s) ┃ TTFAT (s) ┃ Decode TPS (tok/s) ┃ Cache hit rate (%) ┃                     (%) ┃
┡━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ mean   │      ...    │     ...  │      ...  │              ...   │               ... │                    ...  │
└────────┴─────────────┴──────────┴───────────┴────────────────────┴────────────────────┴─────────────────────────┘

                                     Workload metrics
                                completed=N/N  trace/s=…
┏━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┓
┃ Metric                ┃  Overall ┃ Last 30s Window ┃ Steady State ┃ Steady State / GPU ┃
┡━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━┩
│ total prompt tok/s    │    ...   │           ...   │         ...  │              ...   │
│ cached prompt tok/s   │    ...   │           ...   │         ...  │              ...   │
│ uncached prompt tok/s │    ...   │           ...   │         ...  │              ...   │
│ completion tok/s      │    ...   │           ...   │         ...  │              ...   │
└───────────────────────┴──────────┴─────────────────┴──────────────┴────────────────────┘

Metrics

Per-trace

  • Latency (s) — end-to-end latency from the first request of a trace to the final response.
  • TTFT (s) — (streaming) time to the first streamed token of the first request.
  • TTFAT (s) — (streaming) time from trace start to the first streamed token of the final request. The user-visible first token in an agent that hides intermediate tool turns.
  • Decode TPS (tok/s) — (streaming) mean post-TTFT decode throughput across the trace's requests.
  • Cache hit rate (%) — server-reported cached_prompt_tokens / prompt_tokens over all requests in a trace.
  • Eligible cache hit rate (%) — same numerator, denominator restricted to prompt tokens expected to be cacheable. Excludes the initial prompt and, on each turn, the tool output newly appended on that request: sum_i cached_prompt_tokens_i / sum_i eligible_prompt_tokens_i, where eligible_prompt_tokens_0 = 0 and eligible_prompt_tokens_i = prompt_tokens_{i-1} + completion_tokens_{i-1} for i > 0.

Workload

  • trace/s — completed traces per wall-clock second.
  • total / cached / uncached prompt tok/s — aggregate prompt-token throughput, split by what the synthetic workload accounting expects to be cached vs. new.
  • completion tok/s — aggregate completion-token throughput.

Each is reported under four columns:

  • Overall — totals over the full benchmark wall time.
  • Last 30s Window — slope of cumulative token counts over the most recent 30 seconds.
  • Steady Statethe headline throughput metric. Slope after dropping the first 20% of wall time as warmup. Avoids dilution from ramp-up and drain when fewer than concurrency traces are in flight. With concurrency > 1 the completion curve depends on finish order, so the metric has small run-to-run variance even at fixed seed.
  • Steady State / GPUSteady State / num_gpus when num_gpus is set.

Prompt-token throughputs use synthetic workload accounting; cache-hit metrics use server-reported usage. Divergence implies a tokenizer mismatch between client and server.

Known limitations

  • Synthetic prompts are freshly random per trace, so cross-trace prefix sharing (e.g. a common system prompt or tool definitions) is not modeled and cache hit rates may be lower than in a deployment that shares prefixes.
  • Decode TPS assumes the first streamed chunk carries exactly one token. Backends that buffer multiple tokens into the first chunk overstate it slightly.
  • transformers is unpinned; install the version whose tokenizer matches your inference server's. Mismatched versions can produce subtly different token counts and cause prompt-accounting drift.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokenspeed_trie-0.1.1.post20260523.tar.gz (14.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tokenspeed_trie-0.1.1.post20260523-py3-none-any.whl (17.6 kB view details)

Uploaded Python 3

File details

Details for the file tokenspeed_trie-0.1.1.post20260523.tar.gz.

File metadata

File hashes

Hashes for tokenspeed_trie-0.1.1.post20260523.tar.gz
Algorithm Hash digest
SHA256 ae49ad877890db92aa7b33ff24cdf020c78cafe1414c6f664be55a8245c09e86
MD5 32faa141f34c4f26dc35a75f4bf0eb81
BLAKE2b-256 a18d42a68464ccf97359147b1d7f6489c563aee756c329f6570f99954a099ee4

See more details on using hashes here.

Provenance

The following attestation bundles were made for tokenspeed_trie-0.1.1.post20260523.tar.gz:

Publisher: release.yml on lightseekorg/tokenspeed-trie

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file tokenspeed_trie-0.1.1.post20260523-py3-none-any.whl.

File metadata

File hashes

Hashes for tokenspeed_trie-0.1.1.post20260523-py3-none-any.whl
Algorithm Hash digest
SHA256 93624632b8d0afd96b9af9e9bd12b33d5af9821c75e2bf2d11f2198aeb4c425f
MD5 259b6615310888e70f8aca9219b00ade
BLAKE2b-256 dc9c898e46c501a359520d1c70d672f47ac081b05f170370879b20124a02806d

See more details on using hashes here.

Provenance

The following attestation bundles were made for tokenspeed_trie-0.1.1.post20260523-py3-none-any.whl:

Publisher: release.yml on lightseekorg/tokenspeed-trie

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page