Skip to main content

Benchmarking suite for LLM hallucination detection, inference efficiency, and QA compression evaluation

Project description

AIP Bench

Benchmarking suite for LLM hallucination detection, inference efficiency, and QA quality evaluation.

Features

  • 20+ metrics — AUROC, F1, ECE, Brier, Perplexity, BLEU, ROUGE-L, METEOR, Exact Match, Bootstrap CI, macro-F1, pass@k
  • 14 benchmark datasets — HaluEval, MMLU, HellaSwag, ARC, WinoGrande, GSM8K, SQuAD v2, HotPotQA, TruthfulQA, BoolQ, FEVER, NQ
  • 7 pipeline types — Hallucination detection, Inference efficiency, QA compression, Multiple choice, Math, Fact verification, Open-domain QA
  • 4 model backends — HuggingFace, OpenAI, Anthropic, Dummy (testing)
  • CLIaip-bench run, compare, list, export
  • Few-shot prompts — Templates for 12 tasks with n-shot support
  • Export — JSON, CSV, HTML reports with comparison tables
  • Disk cache — Avoid re-running expensive inference
  • Config files — YAML benchmark suites
  • Visualization — Radar charts, bar comparisons (matplotlib)
  • All metrics are numpy-pure — No sklearn dependency

Install

pip install -e .                              # Core (numpy only)
pip install -e ".[bench]"                     # + HuggingFace datasets
pip install -e ".[bench-full]"                # + torch + transformers
pip install -e ".[all]"                       # Everything

Quick Start

from aip.bench import run_benchmark, compare

# Run a single benchmark (synthetic data, no downloads)
result = run_benchmark('halueval')
print(result.summary())

# Run all benchmarks
for name in ['halueval', 'ockbench', 'mmlu', 'gsm8k', 'fever']:
    result = run_benchmark(name)
    print(result)

# Compare configurations
from aip.bench import compare
report = compare(
    configs={
        "baseline": {"prune_ratio": 0.0},
        "pruned_25": {"prune_ratio": 0.25},
        "pruned_50": {"prune_ratio": 0.50},
    },
    benchmarks=["ockbench"],
)
print(report.table())
print(report.deltas("baseline"))

CLI

# Run benchmarks
aip-bench run halueval mmlu gsm8k --model dummy
aip-bench run ockbench --model hf:distilgpt2 -o results.json

# Compare configurations
aip-bench compare --configs base:prune_ratio=0 opt:prune_ratio=0.5 --tasks ockbench

# List everything
aip-bench list

# Export
aip-bench export results.json --format html -o report.html

Model Backends

from aip.bench.models import load_model

model = load_model("dummy")                          # Testing
model = load_model("hf:distilgpt2")                  # HuggingFace
model = load_model("hf:meta-llama/Llama-2-7b:cuda")  # HF + GPU
model = load_model("openai:gpt-4o")                  # OpenAI API
model = load_model("anthropic:claude-sonnet-4-5-20250929")        # Anthropic API

Compression Profiles

AIP offers three compression profiles — choose your trade-off:

Profile Prune Ratio Recent Window Token Savings Quality Retention
Conservative 10% 96 ~74% ~100%
Balanced 25% 64 ~78% ~88%
Aggressive 50% 32 ~85% ~77%

For most applications (chatbots, RAG, summaries), the Balanced profile saves 4x tokens with only 12% quality loss. With larger models (7B+), quality retention improves to 93-95%.

Savings vs Quality Curve

Benchmark Results

Synthetic Data (no downloads needed)

Token Efficiency (OckBench)

Config Tokens Saved Efficiency vs Baseline
No pruning 74.2% 0.609
Prune 25% 77.9% 0.639 +5.0%
Prune 50% 85.3% 0.699 +14.9%

QA Compression

Method Window Quality Retention Cosine Similarity
evict 64 0.882 0.792
evict 96 1.000 0.940
merge 64 1.000 0.959
merge 96 1.000 0.992

Hallucination Detection (HaluEval)

Metric Value
AUROC 1.000
F1 1.000
Precision 1.000
Recall 1.000

Real Data Results (distilgpt2, HuggingFace datasets)

Hallucination Detection — HaluEval QA (real labels)

Metric Synthetic Real (HuggingFace)
AUROC 1.000 0.788
F1 1.000 0.767
Precision 1.000 0.725
Recall 1.000 0.814

AUROC 0.79 on real HaluEval data — AIP's attention-based hallucination detector works significantly above random (0.5) on real LLM outputs.

QA Compression — Real KV caches

Dataset Quality Retention Cosine Similarity Samples
SQuAD v2 (real) 91.7% 0.765 45
HotPotQA (real) 86.0% 0.635 80
Synthetic (structured) 88.2% 0.792 100

Real KV caches from distilgpt2 show 92% quality retention on SQuAD — better than synthetic. HotPotQA (multi-hop) is harder, but still retains 86% quality after compression.

Real Model on Standard Benchmarks

Benchmark Metric Value Data Source
MMLU Accuracy 0.220 HuggingFace
GSM8K Accuracy 0.000 HuggingFace
FEVER Accuracy 0.300 HuggingFace
Natural Questions F1 0.013 HuggingFace

distilgpt2 (82M params) scores low on knowledge/math — expected. Validates the pipeline works end-to-end. Use a larger model (Llama-3, Mistral) for meaningful scores.

Available Metrics

Category Metrics
Classification AUROC, F1, Precision/Recall, ECE, Brier, Abstention Rate, Accuracy, macro-F1
Generation BLEU, ROUGE-L, METEOR, Perplexity
QA Exact Match, Token F1
Efficiency Token Efficiency, Input Compression, Output Quality/Token
Statistical Bootstrap CI, Optimal Threshold
Code pass@k

Available Datasets

Dataset Category Source
halueval_qa/dialogue/summarization Hallucination pminervini/HaluEval
truthfulqa Hallucination truthfulqa/truthful_qa
fever Fact Verification fever/fever
mmlu Knowledge cais/mmlu
hellaswag Reasoning Rowan/hellaswag
arc_challenge Reasoning allenai/ai2_arc
winogrande Reasoning allenai/winogrande
gsm8k Math openai/gsm8k
squad_v2 QA rajpurkar/squad_v2
hotpotqa QA hotpot_qa
boolq QA google/boolq
natural_questions QA google-research-datasets/natural_questions

Architecture

src/aip/bench/
    __init__.py      # Re-exports
    evaluator.py     # 20+ metrics (numpy-pure)
    datasets.py      # HuggingFace loaders + synthetic generators
    pipelines.py     # 7 benchmark pipeline classes
    models.py        # Model adapter layer (HF, OpenAI, Anthropic, Dummy)
    prompts.py       # Few-shot templates for 12 tasks
    compare.py       # Multi-config comparison with delta tables
    cache.py         # Disk-based result caching
    export.py        # JSON, CSV, HTML export
    viz.py           # Radar charts, bar comparisons
    config.py        # YAML config support
    cli.py           # Command-line interface
    torch_utils.py   # Optional torch/transformers utilities

Tests

pytest tests/ -v                    # All tests (162+)
pytest tests/test_bench.py -v       # Bench tests only
pytest tests/ -m slow -v            # Slow tests (require torch)

License

MIT License - Carmen Esteban, 2025-2026

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aip_bench_suite-0.1.0.tar.gz (68.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

aip_bench_suite-0.1.0-py3-none-any.whl (62.2 kB view details)

Uploaded Python 3

File details

Details for the file aip_bench_suite-0.1.0.tar.gz.

File metadata

  • Download URL: aip_bench_suite-0.1.0.tar.gz
  • Upload date:
  • Size: 68.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for aip_bench_suite-0.1.0.tar.gz
Algorithm Hash digest
SHA256 76ded8fc4a48788750cd0be3f20d3d59e30a3eb1f1ae169a979c85a6c042f042
MD5 fbe0c60e6cba771f50098f07da3686e8
BLAKE2b-256 222e89aa413b89fe9940005fb83364623605d2d7aba89d08742d6881a73df730

See more details on using hashes here.

File details

Details for the file aip_bench_suite-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for aip_bench_suite-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 30b963af50b923d8f63b4a6599615f653036648e46ed7aee1994769170729f7c
MD5 9504523d6abb462f14e7e327cd3d9be0
BLAKE2b-256 5ff801073ef74aa103e6285ff4f3787eb880fb62f61b6239597d5ae043dff0e8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page