Benchmarking framework and datasets for temporal automata tasks

These details have not been verified by PyPI

Project links

Project description

Tempo-Bench

Formally grounded LLM benchmark for temporal reasoning over automata/traces. Runs locally via a clean CLI or Python API. Keeps the wheel thin (code + tiny samples) and lets you plug in any model (OpenAI-compatible, Hugging Face, vLLM, or your own class/function).

Features

Tasks: trace acceptance & temporal causality (with per-feature metrics).
Backends: OpenRouter/OpenAI (OpenAI-compatible), HF pipelines, vLLM, or custom Python adapters.
Outputs: row-wise JSONL + CSV with accuracy and F1s (AP & timestep).
Reproducible runs: fixed seeds, manifest-friendly outputs, small packaged sample datasets.

Install

pip install tempobench

Python ≥ 3.10 recommended.

Quickstart (CLI)

# OpenRouter (OpenAI-compatible) example
export OPENROUTER_API_KEY=YOUR_KEY

tempobench run \
  --dataset_path src/tempobench/data/causal-done.jsonl \
  --task causal \
  --backend openrouter \
  --model-id openai/gpt-4o-mini \
  --gen-args '{"temperature":0.0,"max_tokens":256}' \
  --outdir benchmark_results --console-prints

Other backends: This is not implemented yet and is an open issue currently.

# OpenAI
export OPENAI_API_KEY=YOUR_KEY
tempobench run --dataset_path ... --task causal --backend openai --model-id gpt-4o-mini

# Hugging Face (local model)
tempobench run --dataset_path ... --task causal \
  --backend hf --model-id meta-llama/Meta-Llama-3.1-8B-Instruct \
  --model-args '{"device":0}' --gen-args '{"max_new_tokens":256}'

# vLLM server (OpenAI API compatible)
tempobench run --dataset_path ... --task trace \
  --backend vllm --model-id my-vllm \
  --model-args '{"base_url":"http://127.0.0.1:8000/v1","api_key":"nokey"}'

Outputs land under benchmark_results/<task>/ as both .jsonl and .csv.

Python API

You can use the benchmarker to build custom benchmarking workflows that use tempobench logic. Check out the benchmark.py file out on the project github.

from tempobench import Benchmark

bench = Benchmark(
    dataset_path="src/tempobench/data/causal-done.jsonl",
    task="causal",
    model_id="openai/gpt-4o-mini",
    results_dir="benchmark_results",
    console_prints=True,
)

df = bench.evaluate()
print(df.head())

Datasets

Check my huggingface for the tempobench public benchmarking datasets.

If you are interested in access to our datasets for reasoning SFT, reach out to me.

Env vars

You will need to have these env vars set for this to work properly.

OPENROUTER_API_KEY (for --backend openrouter)
OPENAI_API_KEY (for --backend openai)

Results schema (per row)

results_*.jsonl contains:

{
  "model": "openai/gpt-4o-mini",
  "gold": "... (gold JSON) ...",
  "pred": "... (raw text) ...",
  "GT": { "...parsed..." },
  "PRED": { "...parsed..." },
  "correct": true,
  "precision_ap": 1.0,
  "recall_ap": 1.0,
  "F1_ap": 1.0,
  "precision_timestep": 1.0,
  "recall_timestep": 1.0,
  "F1_timestep": 1.0,
  "cost": 0.0023,
  "generation_id": "gen_...",
  "native_prompt_tokens": 123,
  "native_completion_tokens": 45
}

License

MIT (see LICENSE).

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.3

Oct 27, 2025

0.1.0

Oct 27, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tempobench-0.1.3.tar.gz (13.4 kB view details)

Uploaded Oct 27, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

tempobench-0.1.3-py3-none-any.whl (16.1 kB view details)

Uploaded Oct 27, 2025 Python 3

File details

Details for the file tempobench-0.1.3.tar.gz.

File metadata

Download URL: tempobench-0.1.3.tar.gz
Upload date: Oct 27, 2025
Size: 13.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for tempobench-0.1.3.tar.gz
Algorithm	Hash digest
SHA256	`2632e87ea594f08f1d821bdc9d7cbce5f91cbdfe661d22ada8d0d91b0f87f650`
MD5	`31523eaecb2f7a3c98935a1d50f7e4c6`
BLAKE2b-256	`ea56c98d342e8edf2cfc482aa38d006426245120d0a524f3faec017834718642`

See more details on using hashes here.

File details

Details for the file tempobench-0.1.3-py3-none-any.whl.

File metadata

Download URL: tempobench-0.1.3-py3-none-any.whl
Upload date: Oct 27, 2025
Size: 16.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for tempobench-0.1.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`719311a6b3a4697d634828b934831901c5f28525a328837dbb1b7b9fbd2a35c8`
MD5	`da47ffe4431c4a821b1d0f3f605662c7`
BLAKE2b-256	`2f90dc80bcc5ab7ebf8285bfc8c7fca5c1c182cb8e1693d173beaf009aa3dbbe`

See more details on using hashes here.

tempobench 0.1.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Tempo-Bench

Features

Install

Quickstart (CLI)

Python API

Datasets

Env vars

Results schema (per row)

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes