Skip to main content

Lightweight evaluation library for IFBench and IFEval instruction-following benchmarks

Project description

if-verifiable

Lightweight Python library for evaluating LLM outputs against instruction-following benchmarks.

Supports:

  • IFEval (google/IFEval) - Google's Instruction Following Eval
  • IFBench (allenai/IFBench_test) - Allen AI's instruction-following benchmark

Installation

pip install if-verifiable

Usage

from if_verifiable import get_eval_data, evaluate_output_for_sample

# Load samples from a benchmark
for sample in get_eval_data("ifeval"):
    print(f"Prompt: {sample.prompt[:100]}...")
    print(f"Instructions: {sample.instruction_id_list}")
    break

# Evaluate a model's response
sample = next(get_eval_data("ifeval"))
response = "Your model's response here..."

results, scores = evaluate_output_for_sample("ifeval", sample, response)

# Access scores (4 metrics available)
print(f"Partial strict: {scores.partial_strict:.2%}")
print(f"Partial loose: {scores.partial_loose:.2%}")
print(f"Binary strict (all passed): {scores.binary_strict}")
print(f"Binary loose (all passed): {scores.binary_loose}")

# Check individual instruction results
for result in results:
    print(f"  {result.instruction_id}: strict={result.strict_pass}, loose={result.loose_pass}")

API

get_eval_data(benchmark: str) -> Iterator[BenchmarkSample]

Load evaluation samples from a benchmark dataset.

  • benchmark: Either "ifeval" or "ifbench"
  • Returns: Iterator of IFEvalSample or IFBenchSample dataclasses

evaluate_output_for_sample(benchmark, sample, response) -> tuple[list[InstructionResult], EvaluationScores]

Evaluate a model response against a benchmark sample.

  • benchmark: Either "ifeval" or "ifbench"
  • sample: A sample from get_eval_data()
  • response: The model's text response

Returns:

  • list[InstructionResult]: Per-instruction pass/fail results
  • EvaluationScores: Aggregated scores dataclass with 4 metrics:
    • partial_strict: Fraction of instructions passed (strict evaluation)
    • partial_loose: Fraction of instructions passed (loose - allows formatting variations)
    • binary_strict: 1.0 if ALL instructions passed strict, else 0.0
    • binary_loose: 1.0 if ALL instructions passed loose, else 0.0

Types

@dataclass
class IFEvalSample:
    key: int
    prompt: str
    instruction_id_list: list[str]
    kwargs: list[dict[str, Any]]

@dataclass  
class IFBenchSample:
    key: str
    prompt: str
    instruction_id_list: list[str]
    kwargs: list[dict[str, Any]]

@dataclass
class EvaluationScores:
    partial_strict: float  # Fraction of instructions passed (strict)
    partial_loose: float   # Fraction of instructions passed (loose)
    binary_strict: float   # 1.0 if all passed strict, else 0.0
    binary_loose: float    # 1.0 if all passed loose, else 0.0

@dataclass
class InstructionResult:
    instruction_id: str
    strict_pass: bool
    loose_pass: bool

License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

if_verifiable-0.1.0.tar.gz (209.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

if_verifiable-0.1.0-py3-none-any.whl (51.0 kB view details)

Uploaded Python 3

File details

Details for the file if_verifiable-0.1.0.tar.gz.

File metadata

  • Download URL: if_verifiable-0.1.0.tar.gz
  • Upload date:
  • Size: 209.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for if_verifiable-0.1.0.tar.gz
Algorithm Hash digest
SHA256 e0a33441fdd4f940c340bc4bab5337570cf1a5315252e201beaded0471883dea
MD5 0fb04f7e9f93ea2007e068e14b496eb0
BLAKE2b-256 d0a7158b241ee3d69081237fc08dc68e0c366f1752397b88feb197428770ec7d

See more details on using hashes here.

File details

Details for the file if_verifiable-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: if_verifiable-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 51.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for if_verifiable-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ea63a57b25c9a5ed931f775f87a25bebe8d6dec8c69fecbc0bed849ac1bc6d4d
MD5 01566b0ad31dcfe17aee182b27a7042e
BLAKE2b-256 303ddc1e34ad59fb2beec8138c099a328467cf58bd89814f8571bbcc37b75e68

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page