Lightweight evaluation library for IFBench and IFEval instruction-following benchmarks
Project description
if-verifiable
Lightweight Python library for evaluating LLM outputs against instruction-following benchmarks.
Supports:
- IFEval (
google/IFEval) - Google's Instruction Following Eval - IFBench (
allenai/IFBench_test) - Allen AI's instruction-following benchmark
Installation
pip install if-verifiable
Usage
from if_verifiable import get_eval_data, evaluate_output_for_sample
# Load samples from a benchmark
for sample in get_eval_data("ifeval"):
print(f"Prompt: {sample.prompt[:100]}...")
print(f"Instructions: {sample.instruction_id_list}")
break
# Evaluate a model's response
sample = next(get_eval_data("ifeval"))
response = "Your model's response here..."
results, scores = evaluate_output_for_sample("ifeval", sample, response)
# Access scores (4 metrics available)
print(f"Partial strict: {scores.partial_strict:.2%}")
print(f"Partial loose: {scores.partial_loose:.2%}")
print(f"Binary strict (all passed): {scores.binary_strict}")
print(f"Binary loose (all passed): {scores.binary_loose}")
# Check individual instruction results
for result in results:
print(f" {result.instruction_id}: strict={result.strict_pass}, loose={result.loose_pass}")
API
get_eval_data(benchmark: str) -> Iterator[BenchmarkSample]
Load evaluation samples from a benchmark dataset.
benchmark: Either"ifeval"or"ifbench"- Returns: Iterator of
IFEvalSampleorIFBenchSampledataclasses
evaluate_output_for_sample(benchmark, sample, response) -> tuple[list[InstructionResult], EvaluationScores]
Evaluate a model response against a benchmark sample.
benchmark: Either"ifeval"or"ifbench"sample: A sample fromget_eval_data()response: The model's text response
Returns:
list[InstructionResult]: Per-instruction pass/fail resultsEvaluationScores: Aggregated scores dataclass with 4 metrics:partial_strict: Fraction of instructions passed (strict evaluation)partial_loose: Fraction of instructions passed (loose - allows formatting variations)binary_strict: 1.0 if ALL instructions passed strict, else 0.0binary_loose: 1.0 if ALL instructions passed loose, else 0.0
Types
@dataclass
class IFEvalSample:
key: int
prompt: str
instruction_id_list: list[str]
kwargs: list[dict[str, Any]]
@dataclass
class IFBenchSample:
key: str
prompt: str
instruction_id_list: list[str]
kwargs: list[dict[str, Any]]
@dataclass
class EvaluationScores:
partial_strict: float # Fraction of instructions passed (strict)
partial_loose: float # Fraction of instructions passed (loose)
binary_strict: float # 1.0 if all passed strict, else 0.0
binary_loose: float # 1.0 if all passed loose, else 0.0
@dataclass
class InstructionResult:
instruction_id: str
strict_pass: bool
loose_pass: bool
License
Apache 2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file if_verifiable-0.1.0.tar.gz.
File metadata
- Download URL: if_verifiable-0.1.0.tar.gz
- Upload date:
- Size: 209.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e0a33441fdd4f940c340bc4bab5337570cf1a5315252e201beaded0471883dea
|
|
| MD5 |
0fb04f7e9f93ea2007e068e14b496eb0
|
|
| BLAKE2b-256 |
d0a7158b241ee3d69081237fc08dc68e0c366f1752397b88feb197428770ec7d
|
File details
Details for the file if_verifiable-0.1.0-py3-none-any.whl.
File metadata
- Download URL: if_verifiable-0.1.0-py3-none-any.whl
- Upload date:
- Size: 51.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ea63a57b25c9a5ed931f775f87a25bebe8d6dec8c69fecbc0bed849ac1bc6d4d
|
|
| MD5 |
01566b0ad31dcfe17aee182b27a7042e
|
|
| BLAKE2b-256 |
303ddc1e34ad59fb2beec8138c099a328467cf58bd89814f8571bbcc37b75e68
|