Skip to main content

Python SDK for evaluating multiple model outputs using configurable LLM-based jurors

Project description

OpenJury 🏛️

A Python SDK for evaluating your agent's response quality using a configurable panel of LLM judges.

Python 3.11+ License: Apache 2.0


Overview

OpenJury is an agent evaluation framework. Point it at your agent's HTTP endpoint and it will:

  1. Send a prompt to your agent and collect the response
  2. Pass the response to a panel of LLM judges (jurors), each scoring it against your criteria
  3. Return a composite quality score with a full statistical breakdown

The primary output is a single composite_score — a weighted mean of all juror scores across all criteria, plus eight additional canned metrics (median, harmonic mean, weakest link, juror agreement, and more). You can also register a custom scoring function for domain-specific logic.

Why a jury instead of a single judge?

Relying on one LLM to evaluate outputs is common but fragile: it's expensive and prone to intra-model bias. Research from Cohere shows that a panel of smaller, diverse models produces more reliable and less biased evaluations at lower cost.

OpenJury makes this practical: configure jurors declaratively in JSON, wire rubrics per criterion for consistent scoring, and get a structured result you can act on.


Installation

Requirements: Python 3.11 or newer

pip install openjury

From source

git clone https://github.com/robiscoding/openjury.git
cd openjury
pip install -e .
uv pip install -e ".[dev]"     # optional dev dependencies

Choose your path

Track Goal Time API keys?
Try it See output shape, understand AgentEvalResult 2 min No
Evaluate my agent Full fetch + score pipeline 10 min Yes
Production integrate Batch, resume, custom scoring, CI 30+ min Yes

Try it (no agent, no keys)

pip install openjury
python examples/hello_world/score_existing.py

examples/hello_world/ · offline demo with sample output

Evaluate my agent

# Terminal 1 — mock agent (or use your own endpoint)
python examples/tools/mock_agent.py --port 8080

# Terminal 2
export OPENAI_API_KEY="..." AGENT_API_KEY=demo
python examples/basic_usage/basic_jury_run.py

examples/basic_usage/ · docs/endpoint-config.md

Go deeper

  • docs/ — architecture, config schema, composable API, CLI
  • recipes/ — task-oriented how-tos
  • notebooks/ — interactive Jupyter walkthroughs
  • examples/ — full examples index

Quick Start

1. Create a jury config

Set a jury-level llm_provider for shared credentials. Jurors inherit it by default. Use ${ENV_VAR} for secrets.

{
  "name": "Customer Support Jury",
  "score_scale": 5,
  "llm_provider": {
    "provider": "openai_compatible",
    "model_name": "gpt-4o-mini",
    "api_key": "${OPENAI_API_KEY}"
  },
  "jurors": [
    { "name": "Support Expert", "system_prompt": "You are a senior support manager.", "weight": 2.0 },
    { "name": "Customer Perspective", "weight": 1.0 }
  ],
  "criteria": [
    {
      "name": "helpfulness",
      "description": "Does the response resolve the customer's issue?",
      "weight": 2.0,
      "rubric": {
        "1": "Ignores or misunderstands the question",
        "3": "Partially addresses the question",
        "5": "Directly and completely resolves the issue"
      }
    },
    {
      "name": "accuracy",
      "description": "Is the information factually correct?",
      "weight": 2.0,
      "rubric": {
        "1": "Contains factual errors",
        "3": "Mostly accurate with minor gaps",
        "5": "Completely accurate"
      }
    }
  ]
}

Full field reference: docs/config-schema.md

2. Run an evaluation

from openjury import JuryConfig, OpenJury, ResultFormatter
from openjury.endpoint_fetcher import AgentEndpoint

jury = OpenJury(JuryConfig.from_json_file("jury_config.json"))

endpoint = AgentEndpoint(
    url="http://localhost:8080/v1/chat/completions",
    alias="my-agent",
    headers={"Authorization": "Bearer ${AGENT_API_KEY}"},
    request_body_template={
        "model": "my-model",
        "messages": [{"role": "user", "content": "{prompt}"}],
    },
)

result = jury.evaluate(prompt="How do I reset my password?", endpoint=endpoint)
print(ResultFormatter.format_result(result))
print(f"Score: {result.composite_score:.2f} / {result.score_scale}")

score_response() is a backward-compatible alias for evaluate().

CLI:

openjury run \
  --config jury_config.json \
  --endpoints-config endpoints.json \
  --prompt "How do I reset my password?"

3. Read the output

╔══ Quality Evaluation  (scale: 1–5) ══
  composite_score:   3.87 / 5  (0.774 normalized)
  juror_agreement (0–1)        0.880   ← 1 = unanimous
  ...
  • composite_score — headline quality number (weighted_mean from trial 1)
  • juror_agreement — near 1.0 = high confidence; near 0 = contested
  • weakest_link — flags a standout failure even when composite looks okay

Key Features

  • Agent Evaluation — score a single agent response per prompt
  • Structured Rubrics — score anchors per criterion improve inter-juror reliability
  • Eight Canned Metrics — weighted mean, median, harmonic mean, weakest link, juror agreement, and more
  • Custom Scoring — register a Python function for domain-specific composite logic
  • Consistency Auditnum_trials > 1 measures response reliability
  • Batch Evaluation — inline config datasets, JSONL/CSV files, or evaluate_items()
  • Parallel Processing — concurrent jurors and batch items

Examples

Example What it shows
examples/hello_world/ Offline demo — no agent, no API keys
examples/basic_usage/ Single prompt, full pipeline, reading AgentEvalResult
examples/provider_configs/ OpenAI, OpenRouter, Ollama, mixed providers
examples/batch_eval/ Inline/JSONL/CSV datasets, batch-eval CLI
examples/custom_scoring/ ScoreAggregator.register(), safety-gate pattern
examples/consistency_audit/ num_trials > 1, ConsistencyResult.score_std
examples/resume_evaluation/ Fetch/score split, crash recovery
examples/web_server/ Flask API wrapping evaluation
examples/tools/ Mock agent for local development

Full index: examples/README.md


Troubleshooting

Symptom Fix
ConfigurationError for ${VAR} Export env vars before OpenJury(...). See provider-config.md
Partial juror override ValidationError Set all three: model_name, api_key, provider. See config-schema.md
JurorException: missing criterion Juror JSON keys must match criteria[].name exactly
EndpointFetchError Check URL, headers, response_path. See endpoint-config.md
Low juror_agreement Add rubrics, lower juror temperature. See recipes/design-rubrics.md

Documentation

Resource Description
docs/ Architecture, config schema, API, CLI
recipes/ Task-oriented cookbook
notebooks/ Interactive tutorials
CONTRIBUTING.md Development setup

Advanced topics (moved from this README for brevity):


Use Cases

  • Customer support agents — score helpfulness, accuracy, and tone per response
  • Code review assistants — evaluate correctness, readability, and security
  • Content generation — assess clarity, tone, and factuality before publishing
  • Production monitoring — track composite_score drift between model versions
  • Consistency testing — run num_trials=3 before shipping a prompt change

License

Apache License 2.0. See LICENSE.


Contributing

Contributions welcome! See CONTRIBUTING.md.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

openjury-0.4.0.tar.gz (58.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

openjury-0.4.0-py3-none-any.whl (47.2 kB view details)

Uploaded Python 3

File details

Details for the file openjury-0.4.0.tar.gz.

File metadata

  • Download URL: openjury-0.4.0.tar.gz
  • Upload date:
  • Size: 58.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for openjury-0.4.0.tar.gz
Algorithm Hash digest
SHA256 1de07a27ae00fdf2baf938b04cd36050fa52af10c7e101d56fa886fda2f883ea
MD5 2e1e5b206b2fa478375a8f51bcfb8d4e
BLAKE2b-256 10cbfe349732df5168ced1da569f859a693be80b2993482aa645609dbcb57489

See more details on using hashes here.

File details

Details for the file openjury-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: openjury-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 47.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for openjury-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 63dc5aece511685bdc46853368d0835ce8b31f41bf086b74510a8d8cc0f7a85e
MD5 d49c1c628485a32d31cd09a2ad04dab9
BLAKE2b-256 f57b8078e2e961dd2154b7f244808d9dee25f61757678a41d6d97ae419593220

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page