Lightweight evaluation framework that unifies inference through a single VLLM sampler and runs IF-EVAL, IFBench, WritingBench, HealthBench, Arena-Hard, and AlpacaEval end-to-end.

These details have not been verified by PyPI

Project description

eval_framework

Lightweight evaluation framework for academic LLM evaluation. It provides a single vLLM-compatible inference path plus task runners for IF-EVAL, IFBench, WritingBench, HealthBench, Arena-Hard, and AlpacaEval.

Installation

cd eval_framework
uv venv && source .venv/bin/activate
uv pip install -e .
uv pip install vllm --torch-backend=auto
git clone https://github.com/allenai/IFBench .external/IFBench

After release, users can install the package from PyPI:

pip install llm-eval-framework

Arena-Hard v2.0 questions/baselines and AlpacaEval GPT-4 baseline references are bundled under tasks/arena_hard/data/ and tasks/alpaca_eval/data/. IFBench still requires the AllenAI verifier source; clone it to .external/IFBench or pass --ifbench-dir.

After installation the eval-framework command is available globally in the venv.

Quick Start

eval-framework \
  --tasks ifeval \
  --model Qwen3-4B \
  --base-url http://localhost:8000/v1 \
  --output-dir outputs/qwen3-4b

For multi-GPU checkpoint sweeps, copy one of the example scripts and override paths through environment variables:

CKPT_DIR=/path/to/checkpoints \
OUT_DIR=outputs/my_run \
GPU_IDS="0 1 2 3" \
STEPS="120,240,360" \
SKIP_COMPLETE=1 \
bash examples/batch_eval.sh

Tasks

Task	Judge needed?	Key flags
`ifeval`	No (rule-based)	`--ifeval-input`
`ifbench`	No (rule-based)	`--ifbench-dir`, `--ifbench-input`
`writingbench`	Yes	`--writingbench-query`, `--writingbench-write-excel`
`healthbench`	Yes	`--healthbench-data`
`arena-hard`	Yes	`--arena-hard-dir`, `--arena-hard-benchmark`
`alpaca-eval`	Yes	`--alpaca-eval-reference`, `--alpaca-eval-hf-dataset`

Modes

--inference-only — generate responses, skip judging. Judge later with --judge-only.
--judge-only — score existing responses. Only supports writingbench / healthbench / arena-hard / alpaca-eval (ifeval and ifbench are rule-based and score during inference).

Multi-GPU Batch Evaluation

For RL experiments you typically need to evaluate many checkpoints across all benchmarks. We provide ready-to-use scripts in examples/:

Script	Use case
`examples/shard_parallel_eval.sh`	Evaluate ONE model on all benchmarks — shards data across N GPUs for max throughput
`examples/batch_eval.sh`	Evaluate one training run — auto-detects checkpoints, schedules across N GPUs in rounds, judges, plots

Usage:

# 1. Copy and edit the CONFIG section at the top of the script
cp examples/batch_eval.sh my_eval.sh
vim my_eval.sh   # edit CKPT_DIR, OUT_DIR, STEPS, etc.

# 2. Run
bash my_eval.sh

What the scripts handle automatically:

Multi-round scheduling — if you have more checkpoints than GPUs, the script runs them in rounds and cleans up vLLM between rounds
vLLM lifecycle — starts servers, waits for health checks, kills process groups after eval
Judge batching — runs judge jobs in small batches to respect API rate limits (configurable JUDGE_BATCH_SIZE)
Phase control — set RUN_INFERENCE=0 / RUN_JUDGE=0 / RUN_PLOT=0 to skip phases (e.g. re-run judge only after fixing an issue)
Logging — all vLLM and eval logs go to LOG_DIR for debugging; judge stderr (tqdm) is tee'd to terminal

vLLM Tips

Do NOT set --max-model-len unless you know exactly what you're doing. Let the model use its native context length (e.g. 32768 for Qwen3-4B). Setting it too low causes VLLMValidationError on long prompts.
--gpu-memory-utilization 0.95 is safe for H100s and maximizes KV cache.
Increase --num-threads when GPU utilization is low and the serving backend has available capacity.
Kill process groups, not just PIDs — kill -- -${pid} ensures all vLLM child processes are cleaned up. Follow with pkill -f "vllm serve" between rounds.

Output structure

outputs/
├── step_120/
│   ├── run_0/                     # one subdir per sample (mean@N evaluation)
│   │   ├── ifeval/       # summary.json, responses.jsonl
│   │   ├── ifbench/      # summary.json, responses.jsonl, eval_results_*.jsonl
│   │   ├── writingbench/ # responses.jsonl, scores.jsonl, summary.json
│   │   ├── healthbench/  # responses.jsonl, scores.jsonl, summary.json
│   │   ├── arena-hard/   # model_answer/, model_judgment/, summary.json
│   │   └── alpaca-eval/  # model_answer/, model_judgment/, summary.json
│   ├── run_1/ ...                 # up to run_{N-1}
│   ├── ifeval/summary_agg.json    # aggregated mean / std / sem / per_run
│   ├── healthbench/summary_agg.json
│   └── ...
├── step_240/
│   └── ...
└── plots/
    ├── ifeval.png
    ├── ifbench.png
    ├── healthbench.png
    ├── writingbench.png
    ├── arena-hard.png
    ├── alpaca-eval.png
    └── all_tasks.png

run_k/ holds the k-th sample's raw artifacts; summary_agg.json at the step root is what plotting consumes. With N=1 everything still works but error bars collapse to zero width.

Sampling variance (mean@N + error bars)

batch_eval.sh runs each checkpoint N times per task and then aggregates. Because the same live vLLM server handles all N samples, prefix caching amortises prefill — wall time is roughly decode(N)×, not N× cold starts.

Per-task defaults (override with env vars):

Task	Default N	Why
`ifeval` / `ifbench`	8	Rule-based scoring, cost is only GPU decode
`healthbench`	8	Rubric-based, judge cost 8× but gives honest error bars
`writingbench`	4	Large rubric per prompt; 4 samples is usually enough
`arena-hard` / `alpaca-eval`	1	These already report internal bootstrap CI; extra sampling rarely helps

Override any of them:

N_SAMPLES_HEALTHBENCH=4 N_SAMPLES_WRITINGBENCH=1 bash examples/batch_eval.sh

Set them all to 1 to reproduce the original single-run behavior.

Plotting

适用于跑完 inference+judge+aggregate 之后，想任意组合 ckpt eval 结果进行绘图。带了 summary_agg.json 会自动画 error bar；没有就退回普通折线。

python tools/plot_training_curves.py \
  --runs "run_a=outputs/run_a" \
  --runs "run_b=outputs/run_b" \
  --name-pattern "run_a=step_{step}" \
  --name-pattern "run_b=step_{step}" \
  --steps "120,240,360,480,600" \
  --tasks "ifeval,ifbench,healthbench,writingbench,arena-hard,alpaca-eval" \
  --plot-dir outputs/plots \
  --show-errorbar ci95          # ci95 (1.96·SEM) | sem | std | none

batch_eval.sh runs aggregate_runs.py during its plotting phase. To aggregate manually:

python tools/aggregate_runs.py \
  --out-dir outputs/run_a \
  --steps   120,240,360,480,600 \
  --tasks   ifeval,ifbench,healthbench,writingbench,arena-hard,alpaca-eval \
  --n-samples ifeval=8,ifbench=8,healthbench=8,writingbench=4,arena-hard=1,alpaca-eval=1

Judge comparison

Compare scores from different judge models:

python tools/judge_compare.py \
  --judges flash=outputs/qwen3-4B \
  --judges plus=outputs/qwen3-4B-judge-qwen-plus \
  --out outputs/judge_compare.json

Global request throttle

When running many judge jobs in parallel (e.g. 5 background eval-framework processes), all remote API requests share a file-lock-based global throttle to prevent 429 rate-limit errors.

Env var	Default	Description
`MIN_INTERVAL_S`	`0.005` (≈200 QPS)	Minimum interval between consecutive API requests across all threads/processes
`EVAL_THROTTLE_STATE_PATH`	`/tmp/eval_framework_global_throttle.state`	Shared state file path; processes using the same path share one throttle

export MIN_INTERVAL_S=0.01          # ~100 QPS global cap
export EVAL_THROTTLE_STATE_PATH=/tmp/eval_framework_global_throttle.state

Set MIN_INTERVAL_S=0 to disable throttling entirely.

Notes

--output-dir controls where responses/scores/summaries go. With --tasks, output is written to <output-dir>/<task>/.
If you set --served-model-name in vllm serve, pass that same name via --model.
IFBench test data is bundled at tasks/ifbench/data/IFBench_test.jsonl. The AllenAI verifier source resolves from .external/IFBench unless you pass --ifbench-dir.
Arena-Hard questions and baselines (o3-mini-2025-01-31, gemini-2.0-flash-001 for v2.0) are bundled at tasks/arena_hard/data/. Falls back to .external/arena-hard-auto if present. Override with --arena-hard-dir to use a custom repo (e.g. a newer bench version).
AlpacaEval reference outputs auto-download from HuggingFace. Override with --alpaca-eval-reference.
IFBench also needs emoji + syllapy installed (included in pyproject.toml deps).
setuptools<81 is pinned because syllapy depends on pkg_resources which was removed in setuptools 82.

License And Third-Party Assets

The framework code is released under Apache-2.0. Bundled benchmark assets remain under their original upstream licenses and citation requirements. Before redistributing modified benchmark data, check the upstream projects for the current license and attribution terms.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

May 31, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_eval_framework-0.1.0.tar.gz (20.3 MB view details)

Uploaded May 31, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

llm_eval_framework-0.1.0-py3-none-any.whl (20.4 MB view details)

Uploaded May 31, 2026 Python 3

File details

Details for the file llm_eval_framework-0.1.0.tar.gz.

File metadata

Download URL: llm_eval_framework-0.1.0.tar.gz
Upload date: May 31, 2026
Size: 20.3 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for llm_eval_framework-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`09d29f10a53114540ef9a15c013d65e9daf938042e3b9f58ce19adecab69d017`
MD5	`f374d534317e61d5133dc55e4480dc96`
BLAKE2b-256	`f42366ecd28191bcce6f03e48dd9812a6444ce9d91edc3c076f17a7e8cbbfa49`

See more details on using hashes here.

File details

Details for the file llm_eval_framework-0.1.0-py3-none-any.whl.

File metadata

Download URL: llm_eval_framework-0.1.0-py3-none-any.whl
Upload date: May 31, 2026
Size: 20.4 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for llm_eval_framework-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8f7d6a810ca9ce07dd2ab43c62fe05ab55de063951003bc4d84a478698a8d470`
MD5	`78c06d6539715a1b8926c18ac90e7d8c`
BLAKE2b-256	`11dbf983032338d0f85a3dc42f2af6346e79e1a9ba6436b056592030a7b1c4be`

See more details on using hashes here.

llm-eval-framework 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

eval_framework

Installation

Quick Start

Tasks

Modes

Multi-GPU Batch Evaluation

vLLM Tips

Output structure

Sampling variance (mean@N + error bars)

Plotting

Judge comparison

Global request throttle

Notes

License And Third-Party Assets

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes