A small harness for evaluating OpenAI-compatible inference endpoints with synthetic agentic workloads.

These details have not been verified by PyPI

Project description

tokenspeed-trie

trie (trace replay inference evaluation) is a lightweight harness that exercises a running TokenSpeed inference endpoint with synthetic multi-turn workloads derived from production traces.

Prefill-heavy or decode-heavy synthetic benchmarks (1k/8k, 1k/1k, 8k/1k, etc.) don't capture real agentic traffic: it's multi-turn, has high per-turn prefill from tool outputs, and stresses KV-cache management as context grows. trie replays that shape.

Install

pip install tokenspeed-trie

The published package name on PyPI is tokenspeed-trie; the import name and the CLI command stay trie.

Quick start

Three-minute smoke test against a running TokenSpeed endpoint. The example model below is nvidia/Kimi-K2.5-NVFP4; substitute any served model name. Make sure the engine is idle (no leftover traffic from a previous run) before starting.

trie \
  workload_path=agentic \
  endpoint=http://localhost:8000/v1 \
  model=nvidia/Kimi-K2.5-NVFP4 \
  tokenizer_model=nvidia/Kimi-K2.5-NVFP4 \
  concurrency=8 \
  duration=180 \
  stream=True \
  num_gpus=4

workload_path accepts a short alias (agentic / qa / office) backed by the lightseekorg/trie-dataset mirror on Hugging Face, or a filesystem path to a custom JSONL. Aliases are downloaded into /tmp/trie-dataset/ on first use and reused thereafter. Override the cache directory with TRIE_DATASET_CACHE=/some/other/path if /tmp isn't usable.

model is sent to the inference endpoint. tokenizer_model is loaded separately via transformers.AutoTokenizer.from_pretrained(...) to generate synthetic prompts at the requested token lengths. Pass tokenizer_model= explicitly when model is not a valid Hugging Face ID or local tokenizer path.

stream=True is required to surface TTFT, TTFAT, and Decode TPS.

Start TokenSpeed with --enable-cache-report so the server returns usage.prompt_tokens_details.cached_tokens; without it the Cache hit rate (%) columns are zero.

Full benchmark sweep

Mirrors the methodology in Applied Compute's inference benchmark: three workloads × six concurrency levels × 2-hour runs. Total wall time: ~36 hours on a single node.

ENDPOINT=http://localhost:8000/v1
MODEL=nvidia/Kimi-K2.5-NVFP4

mkdir -p logs
for WL in agentic qa office; do
  for C in 8 16 24 32 40 48; do
    trie \
      workload_path=$WL \
      endpoint=$ENDPOINT \
      model=$MODEL \
      tokenizer_model=$MODEL \
      concurrency=$C \
      duration=7200 \
      stream=True \
      num_gpus=4 \
      2>&1 | tee logs/${WL}_c${C}.log
  done
done

duration is the deadline for launching new traces. Once it elapses, the harness stops admitting work and cancels everything in flight.

Python API

from trie import Client

client = Client(
    endpoint="http://localhost:8000/v1",
    model="nvidia/Kimi-K2.5-NVFP4",
)
client.sync_run("agentic", concurrency=8, duration=180, stream=True, num_gpus=4)
# Use client.run(...) directly if you're already inside an event loop.

Workload aliases

Alias	HF dataset filename	Cached path
`agentic`	`agentic_coding_8k.jsonl`	`/tmp/trie-dataset/agentic_coding_8k.jsonl`
`qa`	`code_qa_8k.jsonl`	`/tmp/trie-dataset/code_qa_8k.jsonl`
`office`	`office_work_8k.jsonl`	`/tmp/trie-dataset/office_work_8k.jsonl`

Anything not in the alias table is treated as a filesystem path and passed through unchanged.

Custom workload format

Each JSONL row defines one trace:

num_turns — number of tool-use turns
input_prompt_length — initial user prompt token length
assistant_response_length — per-turn assistant tokens (list of length num_turns)
tool_call_output_length — per-turn tool result tokens (list of length num_turns)
tool_call_latency — per-turn simulated delay in seconds (list of length num_turns)
final_assistant_response_length — final assistant response tokens after all tool turns

Example row:

{"num_turns": 2, "input_prompt_length": 32, "assistant_response_length": [16, 20], "tool_call_output_length": [8, 12], "tool_call_latency": [0.0, 0.0], "final_assistant_response_length": 64}

A trace produces num_turns + 1 completion requests: one per tool-use turn, plus a final turn after the last tool result.

Example output

[info     ] starting benchmark             concurrency=24 duration=300.0 model=nvidia/Kimi-K2.5-NVFP4 num_gpus=4 workload_templates=8192
[info     ] benchmark complete             completed_requests=… failed_requests=0 wall_time_s=… ...

                                                 Per-trace metrics
┏━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃        ┃             ┃          ┃           ┃                    ┃                    ┃ Eligible cache hit rate ┃
┃ Metric ┃ Latency (s) ┃ TTFT (s) ┃ TTFAT (s) ┃ Decode TPS (tok/s) ┃ Cache hit rate (%) ┃                     (%) ┃
┡━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ mean   │      ...    │     ...  │      ...  │              ...   │               ... │                    ...  │
└────────┴─────────────┴──────────┴───────────┴────────────────────┴────────────────────┴─────────────────────────┘

                                     Workload metrics
                                completed=N/N  trace/s=…
┏━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┓
┃ Metric                ┃  Overall ┃ Last 30s Window ┃ Steady State ┃ Steady State / GPU ┃
┡━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━┩
│ total prompt tok/s    │    ...   │           ...   │         ...  │              ...   │
│ cached prompt tok/s   │    ...   │           ...   │         ...  │              ...   │
│ uncached prompt tok/s │    ...   │           ...   │         ...  │              ...   │
│ completion tok/s      │    ...   │           ...   │         ...  │              ...   │
└───────────────────────┴──────────┴─────────────────┴──────────────┴────────────────────┘

Metrics

Per-trace

Latency (s) — end-to-end latency from the first request of a trace to the final response.
TTFT (s) — (streaming) time to the first streamed token of the first request.
TTFAT (s) — (streaming) time from trace start to the first streamed token of the final request. The user-visible first token in an agent that hides intermediate tool turns.
Decode TPS (tok/s) — (streaming) mean post-TTFT decode throughput across the trace's requests.
Cache hit rate (%) — server-reported cached_prompt_tokens / prompt_tokens over all requests in a trace.
Eligible cache hit rate (%) — same numerator, denominator restricted to prompt tokens expected to be cacheable. Excludes the initial prompt and, on each turn, the tool output newly appended on that request: sum_i cached_prompt_tokens_i / sum_i eligible_prompt_tokens_i, where eligible_prompt_tokens_0 = 0 and eligible_prompt_tokens_i = prompt_tokens_{i-1} + completion_tokens_{i-1} for i > 0.

Workload

trace/s — completed traces per wall-clock second.
total / cached / uncached prompt tok/s — aggregate prompt-token throughput, split by what the synthetic workload accounting expects to be cached vs. new.
completion tok/s — aggregate completion-token throughput.

Each is reported under four columns:

Overall — totals over the full benchmark wall time.
Last 30s Window — slope of cumulative token counts over the most recent 30 seconds.
Steady State — the headline throughput metric. Slope after dropping the first 20% of wall time as warmup. Avoids dilution from ramp-up and drain when fewer than concurrency traces are in flight. With concurrency > 1 the completion curve depends on finish order, so the metric has small run-to-run variance even at fixed seed.
Steady State / GPU — Steady State / num_gpus when num_gpus is set.

Prompt-token throughputs use synthetic workload accounting; cache-hit metrics use server-reported usage. Divergence implies a tokenizer mismatch between client and server.

Known limitations

Synthetic prompts are freshly random per trace, so cross-trace prefix sharing (e.g. a common system prompt or tool definitions) is not modeled and cache hit rates may be lower than in a deployment that shares prefixes.
Decode TPS assumes the first streamed chunk carries exactly one token. Backends that buffer multiple tokens into the first chunk overstate it slightly.
transformers is unpinned; install the version whose tokenizer matches your inference server's. Mismatched versions can produce subtly different token counts and cause prompt-accounting drift.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.1.post20260523

May 23, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokenspeed_trie-0.1.1.post20260523.tar.gz (14.3 kB view details)

Uploaded May 23, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

tokenspeed_trie-0.1.1.post20260523-py3-none-any.whl (17.6 kB view details)

Uploaded May 23, 2026 Python 3

File details

Details for the file tokenspeed_trie-0.1.1.post20260523.tar.gz.

File metadata

Download URL: tokenspeed_trie-0.1.1.post20260523.tar.gz
Upload date: May 23, 2026
Size: 14.3 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for tokenspeed_trie-0.1.1.post20260523.tar.gz
Algorithm	Hash digest
SHA256	`ae49ad877890db92aa7b33ff24cdf020c78cafe1414c6f664be55a8245c09e86`
MD5	`32faa141f34c4f26dc35a75f4bf0eb81`
BLAKE2b-256	`a18d42a68464ccf97359147b1d7f6489c563aee756c329f6570f99954a099ee4`

See more details on using hashes here.

Provenance

The following attestation bundles were made for tokenspeed_trie-0.1.1.post20260523.tar.gz:

Publisher: release.yml on lightseekorg/tokenspeed-trie

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: tokenspeed_trie-0.1.1.post20260523.tar.gz
- Subject digest: ae49ad877890db92aa7b33ff24cdf020c78cafe1414c6f664be55a8245c09e86
- Sigstore transparency entry: 1613555835
- Sigstore integration time: May 23, 2026
Source repository:
- Permalink: lightseekorg/tokenspeed-trie@c89fbe5e4e99e787688c7154c2a5a13131168169
- Branch / Tag: refs/heads/main
- Owner: https://github.com/lightseekorg
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@c89fbe5e4e99e787688c7154c2a5a13131168169
- Trigger Event: workflow_dispatch

File details

Details for the file tokenspeed_trie-0.1.1.post20260523-py3-none-any.whl.

File metadata

Download URL: tokenspeed_trie-0.1.1.post20260523-py3-none-any.whl
Upload date: May 23, 2026
Size: 17.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for tokenspeed_trie-0.1.1.post20260523-py3-none-any.whl
Algorithm	Hash digest
SHA256	`93624632b8d0afd96b9af9e9bd12b33d5af9821c75e2bf2d11f2198aeb4c425f`
MD5	`259b6615310888e70f8aca9219b00ade`
BLAKE2b-256	`dc9c898e46c501a359520d1c70d672f47ac081b05f170370879b20124a02806d`

See more details on using hashes here.

Provenance

The following attestation bundles were made for tokenspeed_trie-0.1.1.post20260523-py3-none-any.whl:

Publisher: release.yml on lightseekorg/tokenspeed-trie

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: tokenspeed_trie-0.1.1.post20260523-py3-none-any.whl
- Subject digest: 93624632b8d0afd96b9af9e9bd12b33d5af9821c75e2bf2d11f2198aeb4c425f
- Sigstore transparency entry: 1613556004
- Sigstore integration time: May 23, 2026
Source repository:
- Permalink: lightseekorg/tokenspeed-trie@c89fbe5e4e99e787688c7154c2a5a13131168169
- Branch / Tag: refs/heads/main
- Owner: https://github.com/lightseekorg
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@c89fbe5e4e99e787688c7154c2a5a13131168169
- Trigger Event: workflow_dispatch

tokenspeed-trie 0.1.1.post20260523

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

tokenspeed-trie

Install

Quick start

Full benchmark sweep

Python API

Workload aliases

Custom workload format

Example output

Metrics

Per-trace

Workload

Known limitations

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance