Skip to main content

Lightweight NLP/LLM evaluation toolkit — metrics, judges, significance testing

Project description

juryeval

juryeval

Lightweight NLP/LLM evaluation toolkit: metrics, LLM-as-Judge, statistical significance testing, prompt robustness analysis, and a CLI.

Designed for fast smoke-tests, demos, and as a drop-in dependency for frameworks like LM Eval Harness, DeepEval, Lighteval, and LangChain.

Install

pip install juryeval

Optional extras:

Extra What you get
[judge] LLM-as-Judge (openai)
[semantic] Embedding similarity (sentence-transformers)
[lmeval] lm-eval-harness integration
[full] All metrics (sklearn, sacrebleu, transformers, torch, etc.)
[all] Everything

CLI

# Score a single output
juryeval score --question "What is 2+2?" --output "4"

# Compare two outputs
juryeval compare --question "Capital of France?" --output-a "Paris" --output-b "London"

# Evaluate a dataset
juryeval evaluate --metric classification --predictions preds.json --references refs.json

# Judge calibration
juryeval calibrate --model gpt-4

# Prompt sensitivity analysis
juryeval prompt --question "Explain AI" --num-variants 5

Input files are JSON arrays, JSONL, or plain text (one sample per line). Run juryeval <command> --help for full options.

Python API

Metrics

from juryeval import (
    eval_classification, eval_translation, eval_summarization,
    perplexity, flesch_kincaid, bert_score,
)

acc_f1 = eval_classification(preds=["pos", "neg"], refs=["pos", "pos"])
bleu   = eval_translation(preds=["hello world"], refs=["bonjour le monde"])
rouge  = eval_summarization(preds=["summary"], refs=["reference"])
ppl    = perplexity("This is a sentence.")
fk     = flesch_kincaid("This is easy to read.")
bs     = bert_score(preds=["answer"], refs=["reference"])

LLM-as-Judge

from juryeval import PairwiseJudge, PointwiseJudge, MultiJudgeEnsemble, JudgeCalibration

# Pairwise comparison
judge = PairwiseJudge("gpt-4")
result = judge.compare(
    answer_a="Paris is the capital of France.",
    answer_b="It's Paris.",
    question="What is the capital of France?",
)
# {"winner": "A", "score": 1.0, "reason": "..."}

# Pointwise scoring
scorer = PointwiseJudge("gpt-4")
result = scorer.score("Paris is the capital.", question="What is the capital of France?")
# {"score": 0.9, "reason": "..."}

# Multi-judge ensemble
ensemble = MultiJudgeEnsemble([
    PairwiseJudge("gpt-4"),
    PairwiseJudge("claude-3-opus"),
])
result = ensemble.compare(answer_a, answer_b, question)
# {"majority_winner": "A", "agreement": 0.67, ...}

# Judge calibration
cal = JudgeCalibration()
report = cal.evaluate(judge)
# {"position_bias": 0.05, "consistency": 0.95, "length_bias": 0.1, ...}

Statistical Significance

from juryeval import bootstrap_ci, compare_models

ci = bootstrap_ci(scores, num_resamples=2000)
# {"estimate": 0.72, "lower": 0.68, "upper": 0.76, "std_err": 0.02}

result = compare_models(model_a_scores, model_b_scores)
# {"win_rate": 0.65, "p_value": 0.003, "mean_a": 0.72, "mean_b": 0.68, ...}

Prompt Robustness

from juryeval import PromptVariance

pv = PromptVariance(model_fn=lambda prompt: "output")
report = pv.analyze("What is 2+2?")
# {"num_variants": 7, "output_length_mean": 5.0, "outputs": [...], ...}

Framework Integrations

Framework Setup
lm-eval-harness pip install juryeval[lmeval] then from juryeval.lmeval import register_all; register_all()
DeepEval pip install deepeval[juryeval] then from deepeval.metrics.juryeval import JuryEvalMetric
Lighteval pip install lighteval[juryeval] then use JuryEvalPointwiseJudge / JuryEvalPairwiseJudge metrics
LangChain pip install langchain[juryeval]

See each framework's documentation for detailed usage.

Development

pip install pytest
pytest tests/ -v

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

juryeval-0.5.0.tar.gz (24.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

juryeval-0.5.0-py3-none-any.whl (33.4 kB view details)

Uploaded Python 3

File details

Details for the file juryeval-0.5.0.tar.gz.

File metadata

  • Download URL: juryeval-0.5.0.tar.gz
  • Upload date:
  • Size: 24.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for juryeval-0.5.0.tar.gz
Algorithm Hash digest
SHA256 cf0c1ad6738681725134d1bd50fdec5dc2d977294c3ee16f3f66901f659ebc20
MD5 48f63c25804f4dd68e445c500a2243e7
BLAKE2b-256 cb2c6a90680fbc01587f18720c5572e2556ec0a79477aaa6053184fd684833d7

See more details on using hashes here.

File details

Details for the file juryeval-0.5.0-py3-none-any.whl.

File metadata

  • Download URL: juryeval-0.5.0-py3-none-any.whl
  • Upload date:
  • Size: 33.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for juryeval-0.5.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f104233f76dc1f89c12f4580d2f7aca99a04954a996256682d90bfde7e620997
MD5 9cc3df7c9f4aae87a449ead1557f17c7
BLAKE2b-256 1c568d4064b9fd02acf8ee00794a698a521494b2f01482ce9ca198479ad5bc1f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page