Lightweight evaluation library for IFBench and IFEval instruction-following benchmarks
Project description
if-verifiable
Lightweight Python library for evaluating LLM outputs against instruction-following benchmarks.
Supports:
- IFEval (
google/IFEval) - Google's Instruction Following Eval - IFBench (
allenai/IFBench_test) - Allen AI's instruction-following benchmark
Installation
pip install if-verifiable
Usage
from if_verifiable import get_eval_data, evaluate_output_for_sample
# Load samples from a benchmark
for sample in get_eval_data("ifeval"):
print(f"Prompt: {sample.prompt[:100]}...")
print(f"Instructions: {sample.instruction_id_list}")
break
# Evaluate a model's response
sample = next(get_eval_data("ifeval"))
response = "Your model's response here..."
results, scores = evaluate_output_for_sample("ifeval", sample, response)
# Access scores (4 metrics available)
print(f"Partial strict: {scores.partial_strict:.2%}")
print(f"Partial loose: {scores.partial_loose:.2%}")
print(f"Binary strict (all passed): {scores.binary_strict}")
print(f"Binary loose (all passed): {scores.binary_loose}")
# Check individual instruction results
for result in results:
print(f" {result.instruction_id}: strict={result.strict_pass}, loose={result.loose_pass}")
Batch Evaluation
from if_verifiable import run_eval, run_eval_async, get_eval_data
# Sync batch evaluation with multiprocessing
model_responses = ["response1", "response2", ...] # One per sample
results = run_eval("ifeval", model_responses, max_workers=8)
for sample, response, instruction_results, scores in results:
print(f"{sample.key}: {scores.partial_strict:.2%}")
Async Evaluation
import asyncio
from if_verifiable import run_eval_async, get_eval_data
async def get_model_response(prompt: str) -> dict:
# Your async API call here
return {"content": "model response", "usage": {...}}
samples = list(get_eval_data("ifeval"))
coroutines = [get_model_response(s.prompt) for s in samples]
# Evaluate concurrently with a map function to extract the response string
results = await run_eval_async(
"ifeval",
coroutines,
map_fn=lambda r: r["content"]
)
API
get_eval_data(benchmark: str) -> Iterator[BenchmarkSample]
Load evaluation samples from a benchmark dataset.
benchmark: Either"ifeval"or"ifbench"- Returns: Iterator of
IFEvalSampleorIFBenchSampledataclasses
evaluate_output_for_sample(benchmark, sample, response) -> tuple[list[InstructionResult], EvaluationScores]
Evaluate a model response against a benchmark sample.
benchmark: Either"ifeval"or"ifbench"sample: A sample fromget_eval_data()response: The model's text response
Returns:
list[InstructionResult]: Per-instruction pass/fail resultsEvaluationScores: Aggregated scores dataclass with 4 metrics:partial_strict: Fraction of instructions passed (strict evaluation)partial_loose: Fraction of instructions passed (loose - allows formatting variations)binary_strict: 1.0 if ALL instructions passed strict, else 0.0binary_loose: 1.0 if ALL instructions passed loose, else 0.0
run_eval(benchmark, model_responses, max_workers=None) -> list[EvalResult]
Batch evaluate all responses with multiprocessing.
benchmark: Either"ifeval"or"ifbench"model_responses: List of response strings, one per sample in datasetmax_workers: Number of parallel workers (None = auto)
Returns list of (sample, response, instruction_results, scores) tuples.
run_eval_async(benchmark, coroutines, map_fn=str) -> list[EvalResult]
Evaluate responses from async coroutines concurrently.
benchmark: Either"ifeval"or"ifbench"coroutines: List of awaitables, one per samplemap_fn: Function to extract response string from coroutine result
Returns list of (sample, response, instruction_results, scores) tuples in input order.
Types
@dataclass
class IFEvalSample:
key: int
prompt: str
instruction_id_list: list[str]
kwargs: list[dict[str, Any]]
@dataclass
class IFBenchSample:
key: str
prompt: str
instruction_id_list: list[str]
kwargs: list[dict[str, Any]]
@dataclass
class EvaluationScores:
partial_strict: float # Fraction of instructions passed (strict)
partial_loose: float # Fraction of instructions passed (loose)
binary_strict: float # 1.0 if all passed strict, else 0.0
binary_loose: float # 1.0 if all passed loose, else 0.0
@dataclass
class InstructionResult:
instruction_id: str
strict_pass: bool
loose_pass: bool
# Type alias for batch evaluation results
EvalResult = tuple[BenchmarkSample, str, list[InstructionResult], EvaluationScores]
License
Apache 2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file if_verifiable-0.1.2.tar.gz.
File metadata
- Download URL: if_verifiable-0.1.2.tar.gz
- Upload date:
- Size: 160.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
32f28e6dc6a28f9589e9d1cbe4ad0e1248fc5134276816797a7e97c7b0e27cbc
|
|
| MD5 |
53be94eb7efc793858ff68e0d2d8374f
|
|
| BLAKE2b-256 |
75185ad4f861acbe59dfd5aa48149aee117f54e25f5a58bf7f3cf85796ca796a
|
File details
Details for the file if_verifiable-0.1.2-py3-none-any.whl.
File metadata
- Download URL: if_verifiable-0.1.2-py3-none-any.whl
- Upload date:
- Size: 52.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
911610fc69d9618fadf140b162ef0b4cb8798c31985874db5c41c1a5e3bcad76
|
|
| MD5 |
1cc9de8a3f5f7ce38288efc8b0ac0070
|
|
| BLAKE2b-256 |
006aeff23144bb61a6a10e27993fbbe394942b9c529db6f0b3d4dfcf36879c71
|