Lightweight NLP/LLM evaluation toolkit — metrics, judges, significance testing
Project description
juryeval
Lightweight NLP/LLM evaluation toolkit: metrics, LLM-as-Judge, statistical significance testing, prompt robustness analysis, and a CLI.
Designed for fast smoke-tests, demos, and as a drop-in dependency for frameworks like LM Eval Harness, DeepEval, Lighteval, and LangChain.
Install
pip install juryeval
Optional extras:
| Extra | What you get |
|---|---|
[judge] |
LLM-as-Judge (openai) |
[semantic] |
Embedding similarity (sentence-transformers) |
[lmeval] |
lm-eval-harness integration |
[full] |
All metrics (sklearn, sacrebleu, transformers, torch, etc.) |
[all] |
Everything |
CLI
# Score a single output
juryeval score --question "What is 2+2?" --output "4"
# Compare two outputs
juryeval compare --question "Capital of France?" --output-a "Paris" --output-b "London"
# Evaluate a dataset
juryeval evaluate --metric classification --predictions preds.json --references refs.json
# Judge calibration
juryeval calibrate --model gpt-4
# Prompt sensitivity analysis
juryeval prompt --question "Explain AI" --num-variants 5
Input files are JSON arrays, JSONL, or plain text (one sample per line). Run juryeval <command> --help for full options.
Python API
Metrics
from juryeval import (
eval_classification, eval_translation, eval_summarization,
perplexity, flesch_kincaid, bert_score,
)
acc_f1 = eval_classification(preds=["pos", "neg"], refs=["pos", "pos"])
bleu = eval_translation(preds=["hello world"], refs=["bonjour le monde"])
rouge = eval_summarization(preds=["summary"], refs=["reference"])
ppl = perplexity("This is a sentence.")
fk = flesch_kincaid("This is easy to read.")
bs = bert_score(preds=["answer"], refs=["reference"])
LLM-as-Judge
from juryeval import PairwiseJudge, PointwiseJudge, MultiJudgeEnsemble, JudgeCalibration
# Pairwise comparison
judge = PairwiseJudge("gpt-4")
result = judge.compare(
answer_a="Paris is the capital of France.",
answer_b="It's Paris.",
question="What is the capital of France?",
)
# {"winner": "A", "score": 1.0, "reason": "..."}
# Pointwise scoring
scorer = PointwiseJudge("gpt-4")
result = scorer.score("Paris is the capital.", question="What is the capital of France?")
# {"score": 0.9, "reason": "..."}
# Multi-judge ensemble
ensemble = MultiJudgeEnsemble([
PairwiseJudge("gpt-4"),
PairwiseJudge("claude-3-opus"),
])
result = ensemble.compare(answer_a, answer_b, question)
# {"majority_winner": "A", "agreement": 0.67, ...}
# Judge calibration
cal = JudgeCalibration()
report = cal.evaluate(judge)
# {"position_bias": 0.05, "consistency": 0.95, "length_bias": 0.1, ...}
Statistical Significance
from juryeval import bootstrap_ci, compare_models
ci = bootstrap_ci(scores, num_resamples=2000)
# {"estimate": 0.72, "lower": 0.68, "upper": 0.76, "std_err": 0.02}
result = compare_models(model_a_scores, model_b_scores)
# {"win_rate": 0.65, "p_value": 0.003, "mean_a": 0.72, "mean_b": 0.68, ...}
Prompt Robustness
from juryeval import PromptVariance
pv = PromptVariance(model_fn=lambda prompt: "output")
report = pv.analyze("What is 2+2?")
# {"num_variants": 7, "output_length_mean": 5.0, "outputs": [...], ...}
Framework Integrations
| Framework | Setup |
|---|---|
| lm-eval-harness | pip install juryeval[lmeval] then from juryeval.lmeval import register_all; register_all() |
| DeepEval | pip install deepeval[juryeval] then from deepeval.metrics.juryeval import JuryEvalMetric |
| Lighteval | pip install lighteval[juryeval] then use JuryEvalPointwiseJudge / JuryEvalPairwiseJudge metrics |
| LangChain | pip install langchain[juryeval] |
See each framework's documentation for detailed usage.
Development
pip install pytest
pytest tests/ -v
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file juryeval-0.5.0.tar.gz.
File metadata
- Download URL: juryeval-0.5.0.tar.gz
- Upload date:
- Size: 24.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cf0c1ad6738681725134d1bd50fdec5dc2d977294c3ee16f3f66901f659ebc20
|
|
| MD5 |
48f63c25804f4dd68e445c500a2243e7
|
|
| BLAKE2b-256 |
cb2c6a90680fbc01587f18720c5572e2556ec0a79477aaa6053184fd684833d7
|
File details
Details for the file juryeval-0.5.0-py3-none-any.whl.
File metadata
- Download URL: juryeval-0.5.0-py3-none-any.whl
- Upload date:
- Size: 33.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f104233f76dc1f89c12f4580d2f7aca99a04954a996256682d90bfde7e620997
|
|
| MD5 |
9cc3df7c9f4aae87a449ead1557f17c7
|
|
| BLAKE2b-256 |
1c568d4064b9fd02acf8ee00794a698a521494b2f01482ce9ca198479ad5bc1f
|