Skip to main content

Python SDK for evaluating multiple model outputs using configurable LLM-based jurors

Project description

OpenJury 🏛️

A Python SDK for evaluating your agent's response quality using a configurable panel of LLM judges.

Python 3.11+ License: Apache 2.0


Overview

OpenJury is an agent evaluation framework. Point it at your agent's HTTP endpoint and it will:

  1. Send a prompt to your agent and collect the response
  2. Pass the response to a panel of LLM judges (jurors), each scoring it against your criteria
  3. Return a composite quality score with a full statistical breakdown

The primary output is a single composite_score — a weighted mean of all juror scores across all criteria, plus eight additional canned metrics (median, harmonic mean, weakest link, juror agreement, and more). You can also register a custom scoring function for domain-specific logic.

Why a jury instead of a single judge?

Relying on one LLM to evaluate outputs is common but fragile: it's expensive and prone to intra-model bias. Research from Cohere shows that a panel of smaller, diverse models produces more reliable and less biased evaluations at lower cost.

OpenJury makes this practical: configure jurors declaratively in JSON, wire rubrics per criterion for consistent scoring, and get a structured result you can act on.


Installation

Requirements: Python 3.11 or newer

pip install openjury

From source

git clone https://github.com/robiscoding/openjury.git
cd openjury
pip install -e .
uv pip install -e ".[dev]"     # optional dev dependencies

Choose your path

Track Goal Time API keys?
Try it See output shape, understand AgentEvalResult 2 min No
Evaluate my agent Full fetch + score pipeline 10 min Yes
Production integrate Batch, resume, custom scoring, CI 30+ min Yes

Try it (no agent, no keys)

pip install openjury
python examples/hello_world/score_existing.py

examples/hello_world/ · offline demo with sample output

Evaluate my agent

# Terminal 1 — mock agent (or use your own endpoint)
python examples/tools/mock_agent.py --port 8080

# Terminal 2
export OPENAI_API_KEY="..." AGENT_API_KEY=demo
python examples/basic_usage/basic_jury_run.py

examples/basic_usage/ · docs/endpoint-config.md

Go deeper

  • docs/ — architecture, config schema, composable API, CLI
  • recipes/ — task-oriented how-tos
  • notebooks/ — interactive Jupyter walkthroughs
  • examples/ — full examples index

Quick Start

1. Create a jury config

Set a jury-level llm_provider for shared credentials. Jurors inherit it by default. Use ${ENV_VAR} for secrets.

{
  "name": "Customer Support Jury",
  "score_scale": 5,
  "llm_provider": {
    "provider": "openai_compatible",
    "model_name": "gpt-4o-mini",
    "api_key": "${OPENAI_API_KEY}"
  },
  "jurors": [
    { "name": "Support Expert", "system_prompt": "You are a senior support manager.", "weight": 2.0 },
    { "name": "Customer Perspective", "weight": 1.0 }
  ],
  "criteria": [
    {
      "name": "helpfulness",
      "description": "Does the response resolve the customer's issue?",
      "weight": 2.0,
      "rubric": {
        "1": "Ignores or misunderstands the question",
        "3": "Partially addresses the question",
        "5": "Directly and completely resolves the issue"
      }
    },
    {
      "name": "accuracy",
      "description": "Is the information factually correct?",
      "weight": 2.0,
      "rubric": {
        "1": "Contains factual errors",
        "3": "Mostly accurate with minor gaps",
        "5": "Completely accurate"
      }
    }
  ]
}

Full field reference: docs/config-schema.md

2. Run an evaluation

from openjury import JuryConfig, OpenJury, ResultFormatter
from openjury.endpoint_fetcher import AgentEndpoint

jury = OpenJury(JuryConfig.from_json_file("jury_config.json"))

endpoint = AgentEndpoint(
    url="http://localhost:8080/v1/chat/completions",
    alias="my-agent",
    headers={"Authorization": "Bearer ${AGENT_API_KEY}"},
    request_body_template={
        "model": "my-model",
        "messages": [{"role": "user", "content": "{prompt}"}],
    },
)

result = jury.evaluate(prompt="How do I reset my password?", endpoint=endpoint)
print(ResultFormatter.format_result(result))
print(f"Score: {result.composite_score:.2f} / {result.score_scale}")

score_response() is a backward-compatible alias for evaluate().

CLI:

openjury run \
  --config jury_config.json \
  --endpoints-config endpoints.json \
  --prompt "How do I reset my password?"

3. Read the output

╔══ Quality Evaluation  (scale: 1–5) ══
  composite_score:   3.87 / 5  (0.774 normalized)
  juror_agreement (0–1)        0.880   ← 1 = unanimous
  ...
  • composite_score — headline quality number (weighted_mean from trial 1)
  • juror_agreement — near 1.0 = high confidence; near 0 = contested
  • weakest_link — flags a standout failure even when composite looks okay

Key Features

  • Agent Evaluation — score a single agent response per prompt
  • Structured Rubrics — score anchors per criterion improve inter-juror reliability
  • Eight Canned Metrics — weighted mean, median, harmonic mean, weakest link, juror agreement, and more
  • Custom Scoring — register a Python function for domain-specific composite logic
  • Consistency Auditnum_trials > 1 measures response reliability
  • Batch Evaluation — inline config datasets, JSONL/CSV files, or evaluate_items()
  • Parallel Processing — concurrent jurors and batch items

Examples

Example What it shows
examples/hello_world/ Offline demo — no agent, no API keys
examples/basic_usage/ Single prompt, full pipeline, reading AgentEvalResult
examples/provider_configs/ OpenAI, OpenRouter, Ollama, mixed providers
examples/batch_eval/ Inline/JSONL/CSV datasets, batch-eval CLI
examples/custom_scoring/ ScoreAggregator.register(), safety-gate pattern
examples/consistency_audit/ num_trials > 1, ConsistencyResult.score_std
examples/resume_evaluation/ Fetch/score split, crash recovery
examples/web_server/ Flask API wrapping evaluation
examples/tools/ Mock agent for local development

Full index: examples/README.md


Troubleshooting

Symptom Fix
ConfigurationError for ${VAR} Export env vars before OpenJury(...). See provider-config.md
Partial juror override ValidationError Set all three: model_name, api_key, provider. See config-schema.md
JurorException: missing criterion Juror JSON keys must match criteria[].name exactly
EndpointFetchError Check URL, headers, response_path. See endpoint-config.md
Low juror_agreement Add rubrics, lower juror temperature. See recipes/design-rubrics.md

Documentation

Resource Description
docs/ Architecture, config schema, API, CLI
recipes/ Task-oriented cookbook
notebooks/ Interactive tutorials
CONTRIBUTING.md Development setup

Advanced topics (moved from this README for brevity):


Use Cases

  • Customer support agents — score helpfulness, accuracy, and tone per response
  • Code review assistants — evaluate correctness, readability, and security
  • Content generation — assess clarity, tone, and factuality before publishing
  • Production monitoring — track composite_score drift between model versions
  • Consistency testing — run num_trials=3 before shipping a prompt change

License

Apache License 2.0. See LICENSE.


Contributing

Contributions welcome! See CONTRIBUTING.md.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

openjury-0.5.0.tar.gz (67.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

openjury-0.5.0-py3-none-any.whl (54.5 kB view details)

Uploaded Python 3

File details

Details for the file openjury-0.5.0.tar.gz.

File metadata

  • Download URL: openjury-0.5.0.tar.gz
  • Upload date:
  • Size: 67.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for openjury-0.5.0.tar.gz
Algorithm Hash digest
SHA256 0d22c7242b61f30d0f88745531acf966fdda2aacc1fac8d843963912c983b397
MD5 b01ba86c2d03e2189ad47b3219292858
BLAKE2b-256 157fc834d2eaf2d2af02619c03a02c206976cab9819d9cd959b0f90c48659332

See more details on using hashes here.

File details

Details for the file openjury-0.5.0-py3-none-any.whl.

File metadata

  • Download URL: openjury-0.5.0-py3-none-any.whl
  • Upload date:
  • Size: 54.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for openjury-0.5.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9ce972d9d1a4b0987597ac5feb5e43ccef8a4ffa94890dab803e666c62f2e96a
MD5 1c2ba3a6948a1119d08f1708a724a729
BLAKE2b-256 3f631beb541d3f8eb960c7e9ea379a1fd2963d9ceea2080e64f939065cba6453

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page