Benchmarking framework and datasets for temporal automata tasks
Project description
Tempo-Bench
Formally grounded LLM benchmark for temporal reasoning over automata/traces. Runs locally via a clean CLI or Python API. Keeps the wheel thin (code + tiny samples) and lets you plug in any model (OpenAI-compatible, Hugging Face, vLLM, or your own class/function).
Features
- Tasks: trace acceptance & temporal causality (with per-feature metrics).
- Backends: OpenRouter/OpenAI (OpenAI-compatible), HF pipelines, vLLM, or custom Python adapters.
- Outputs: row-wise JSONL + CSV with accuracy and F1s (AP & timestep).
- Reproducible runs: fixed seeds, manifest-friendly outputs, small packaged sample datasets.
Install
pip install tempobench
Python ≥ 3.10 recommended.
Quickstart (CLI)
# OpenRouter (OpenAI-compatible) example
export OPENROUTER_API_KEY=YOUR_KEY
tempobench run \
--dataset_path src/tempobench/data/causal-done.jsonl \
--task causal \
--backend openrouter \
--model-id openai/gpt-4o-mini \
--gen-args '{"temperature":0.0,"max_tokens":256}' \
--outdir benchmark_results --console-prints
Other backends: This is not implemented yet and is an open issue currently.
# OpenAI
export OPENAI_API_KEY=YOUR_KEY
tempobench run --dataset_path ... --task causal --backend openai --model-id gpt-4o-mini
# Hugging Face (local model)
tempobench run --dataset_path ... --task causal \
--backend hf --model-id meta-llama/Meta-Llama-3.1-8B-Instruct \
--model-args '{"device":0}' --gen-args '{"max_new_tokens":256}'
# vLLM server (OpenAI API compatible)
tempobench run --dataset_path ... --task trace \
--backend vllm --model-id my-vllm \
--model-args '{"base_url":"http://127.0.0.1:8000/v1","api_key":"nokey"}'
Outputs land under benchmark_results/<task>/ as both .jsonl and .csv.
Python API
You can use the benchmarker to build custom benchmarking workflows that use tempobench logic. Check out the benchmark.py file out on the project github.
from tempobench import Benchmark
bench = Benchmark(
dataset_path="src/tempobench/data/causal-done.jsonl",
task="causal",
model_id="openai/gpt-4o-mini",
results_dir="benchmark_results",
console_prints=True,
)
df = bench.evaluate()
print(df.head())
Datasets
Check my huggingface for the tempobench public benchmarking datasets.
If you are interested in access to our datasets for reasoning SFT, reach out to me.
Env vars
You will need to have these env vars set for this to work properly.
OPENROUTER_API_KEY(for--backend openrouter)OPENAI_API_KEY(for--backend openai)
Results schema (per row)
results_*.jsonl contains:
{
"model": "openai/gpt-4o-mini",
"gold": "... (gold JSON) ...",
"pred": "... (raw text) ...",
"GT": { "...parsed..." },
"PRED": { "...parsed..." },
"correct": true,
"precision_ap": 1.0,
"recall_ap": 1.0,
"F1_ap": 1.0,
"precision_timestep": 1.0,
"recall_timestep": 1.0,
"F1_timestep": 1.0,
"cost": 0.0023,
"generation_id": "gen_...",
"native_prompt_tokens": 123,
"native_completion_tokens": 45
}
License
MIT (see LICENSE).
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tempobench-0.1.0.tar.gz.
File metadata
- Download URL: tempobench-0.1.0.tar.gz
- Upload date:
- Size: 13.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3775067e3d02c4976d81208b91425e44b50527a44535e4d7f9d6da8b22d500f2
|
|
| MD5 |
dee10ad9875554facf4a1403ce9aaaa6
|
|
| BLAKE2b-256 |
e1014183a5eada860252bfdf08a09383ac75315df04d814a8159b43424424634
|
File details
Details for the file tempobench-0.1.0-py3-none-any.whl.
File metadata
- Download URL: tempobench-0.1.0-py3-none-any.whl
- Upload date:
- Size: 16.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
21e0fe9db8ebfe9c70594133c51170048c211d9e3a37bb5fd80171ffe3566dcd
|
|
| MD5 |
54ab3c7e5f52ae86c05e1dcfddff946f
|
|
| BLAKE2b-256 |
ac0eb37a42b96b2a4b4ca99f8e1c7ad0c41fd737074bd0e3e4d8868c63230752
|