Skip to main content

Benchmarking framework and datasets for temporal automata tasks

Project description

Tempo-Bench

Formally grounded LLM benchmark for temporal reasoning over automata/traces. Runs locally via a clean CLI or Python API. Keeps the wheel thin (code + tiny samples) and lets you plug in any model (OpenAI-compatible, Hugging Face, vLLM, or your own class/function).


Features

  • Tasks: trace acceptance & temporal causality (with per-feature metrics).
  • Backends: OpenRouter/OpenAI (OpenAI-compatible), HF pipelines, vLLM, or custom Python adapters.
  • Outputs: row-wise JSONL + CSV with accuracy and F1s (AP & timestep).
  • Reproducible runs: fixed seeds, manifest-friendly outputs, small packaged sample datasets.

Install

pip install tempobench

Python ≥ 3.10 recommended.


Quickstart (CLI)

# OpenRouter (OpenAI-compatible) example
export OPENROUTER_API_KEY=YOUR_KEY

tempobench run \
  --dataset_path src/tempobench/data/causal-done.jsonl \
  --task causal \
  --backend openrouter \
  --model-id openai/gpt-4o-mini \
  --gen-args '{"temperature":0.0,"max_tokens":256}' \
  --outdir benchmark_results --console-prints

Other backends: This is not implemented yet and is an open issue currently.

# OpenAI
export OPENAI_API_KEY=YOUR_KEY
tempobench run --dataset_path ... --task causal --backend openai --model-id gpt-4o-mini

# Hugging Face (local model)
tempobench run --dataset_path ... --task causal \
  --backend hf --model-id meta-llama/Meta-Llama-3.1-8B-Instruct \
  --model-args '{"device":0}' --gen-args '{"max_new_tokens":256}'

# vLLM server (OpenAI API compatible)
tempobench run --dataset_path ... --task trace \
  --backend vllm --model-id my-vllm \
  --model-args '{"base_url":"http://127.0.0.1:8000/v1","api_key":"nokey"}'

Outputs land under benchmark_results/<task>/ as both .jsonl and .csv.


Python API

You can use the benchmarker to build custom benchmarking workflows that use tempobench logic. Check out the benchmark.py file out on the project github.

from tempobench import Benchmark

bench = Benchmark(
    dataset_path="src/tempobench/data/causal-done.jsonl",
    task="causal",
    model_id="openai/gpt-4o-mini",
    results_dir="benchmark_results",
    console_prints=True,
)

df = bench.evaluate()
print(df.head())

Datasets

Check my huggingface for the tempobench public benchmarking datasets.

If you are interested in access to our datasets for reasoning SFT, reach out to me.


Env vars

You will need to have these env vars set for this to work properly.

  • OPENROUTER_API_KEY (for --backend openrouter)
  • OPENAI_API_KEY (for --backend openai)

Results schema (per row)

results_*.jsonl contains:

{
  "model": "openai/gpt-4o-mini",
  "gold": "... (gold JSON) ...",
  "pred": "... (raw text) ...",
  "GT": { "...parsed..." },
  "PRED": { "...parsed..." },
  "correct": true,
  "precision_ap": 1.0,
  "recall_ap": 1.0,
  "F1_ap": 1.0,
  "precision_timestep": 1.0,
  "recall_timestep": 1.0,
  "F1_timestep": 1.0,
  "cost": 0.0023,
  "generation_id": "gen_...",
  "native_prompt_tokens": 123,
  "native_completion_tokens": 45
}

License

MIT (see LICENSE).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tempobench-0.1.3.tar.gz (13.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tempobench-0.1.3-py3-none-any.whl (16.1 kB view details)

Uploaded Python 3

File details

Details for the file tempobench-0.1.3.tar.gz.

File metadata

  • Download URL: tempobench-0.1.3.tar.gz
  • Upload date:
  • Size: 13.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for tempobench-0.1.3.tar.gz
Algorithm Hash digest
SHA256 2632e87ea594f08f1d821bdc9d7cbce5f91cbdfe661d22ada8d0d91b0f87f650
MD5 31523eaecb2f7a3c98935a1d50f7e4c6
BLAKE2b-256 ea56c98d342e8edf2cfc482aa38d006426245120d0a524f3faec017834718642

See more details on using hashes here.

File details

Details for the file tempobench-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: tempobench-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 16.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for tempobench-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 719311a6b3a4697d634828b934831901c5f28525a328837dbb1b7b9fbd2a35c8
MD5 da47ffe4431c4a821b1d0f3f605662c7
BLAKE2b-256 2f90dc80bcc5ab7ebf8285bfc8c7fca5c1c182cb8e1693d173beaf009aa3dbbe

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page