Lightweight NLP/LLM evaluation toolkit — metrics, judges, significance testing
Project description
juryeval
Lightweight NLP/LLM evaluation toolkit — metrics, LLM-as-Judge infrastructure, statistical significance testing, and prompt robustness analysis.
Designed for fast smoke-tests, demos, and as a shared dependency for evaluation frameworks like LM Eval Harness, OpenCompass, and Lighteval.
Install
pip install juryeval
# Optional feature sets:
pip install juryeval[full] # all metrics (sklearn, sacrebleu, transformers, etc.)
pip install juryeval[judge] # LLM-as-Judge (openai)
pip install juryeval[semantic] # embedding similarity (sentence-transformers)
pip install juryeval[lmeval] # lm-eval-harness integration
pip install juryeval[all] # everything
Usage
Metrics
from juryeval import (
eval_classification, eval_translation, eval_summarization,
perplexity, flesch_kincaid, bert_score,
)
acc_f1 = eval_classification(preds=["pos", "neg"], refs=["pos", "pos"])
bleu = eval_translation(preds=["hello world"], refs=["bonjour le monde"])
rouge = eval_summarization(preds=["summary here"], refs=["reference here"])
ppl = perplexity("This is a sentence.")
fk = flesch_kincaid("This is easy to read.")
bs = bert_score(preds=["answer"], refs=["reference"])
LLM-as-Judge
from juryeval import PairwiseJudge, PointwiseJudge, MultiJudgeEnsemble, JudgeCalibration
judge = PairwiseJudge("gpt-4")
result = judge.compare(
answer_a="Paris is the capital of France.",
answer_b="It's Paris.",
question="What is the capital of France?",
)
# {"winner": "A", "score": 1.0, "reason": "..."}
# Pointwise scoring
scorer = PointwiseJudge("gpt-4")
result = scorer.score("Paris is the capital.", question="What is the capital of France?")
# {"score": 0.9, "reason": "..."}
# Multi-judge ensemble
ensemble = MultiJudgeEnsemble([
PairwiseJudge("gpt-4"),
PairwiseJudge("claude-3-opus"),
PairwiseJudge("gemini-pro"),
])
result = ensemble.compare(answer_a, answer_b, question)
# {"majority_winner": "A", "agreement": 0.67, "vote_distribution": {...}, ...}
# Judge calibration
cal = JudgeCalibration()
report = cal.evaluate(judge)
# {"position_bias": 0.05, "consistency": 0.95, "length_bias": 0.1, ...}
Statistical Significance
from juryeval import bootstrap_ci, compare_models
ci = bootstrap_ci(scores, num_resamples=2000)
# {"estimate": 0.72, "lower": 0.68, "upper": 0.76, "std_err": 0.02}
result = compare_models(model_a_scores, model_b_scores)
# {"win_rate": 0.65, "p_value": 0.003, "mean_a": 0.72, "mean_b": 0.68, ...}
Prompt Robustness
from juryeval import PromptVariance
pv = PromptVariance(model_fn=lambda prompt: "output")
report = pv.analyze("What is 2+2?")
# {"num_variants": 7, "output_length_mean": 5.0, "outputs": [...], ...}
LM Eval Harness Integration
pip install juryeval[lmeval]
python -c "from juryeval.lmeval import register_all; register_all()"
# Then register pairwise_judge / pointwise_judge metrics in your task YAML:
# metric_list:
# - metric: pairwise_judge
# aggregation: mean
# higher_is_better: true
Running Tests
pip install pytest
pytest tests/ -v
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file juryeval-0.4.1.tar.gz.
File metadata
- Download URL: juryeval-0.4.1.tar.gz
- Upload date:
- Size: 24.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6856521a93f5a7895d02080d0ba0018fedfa29f31754ee493714fba1a3dbe729
|
|
| MD5 |
d698c89fb0e364d85d46f8c220847582
|
|
| BLAKE2b-256 |
61545973058e017f567219ce316420adbac787bc3f15cc84ff97290cdae59cfe
|
File details
Details for the file juryeval-0.4.1-py3-none-any.whl.
File metadata
- Download URL: juryeval-0.4.1-py3-none-any.whl
- Upload date:
- Size: 32.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
480bc4843cb0c08bb23d342dcbaf4ef0891966bd3fe8bb723655fdfd8d4b3c57
|
|
| MD5 |
e686c75f8a49f1e4cffcab37b0ad25df
|
|
| BLAKE2b-256 |
83f12fe4a8f23b26b4be712f78af97ef3c3875730624050153d3f0d2b31ae8c8
|