Skip to main content

Python SDK for evaluating multiple model outputs using configurable LLM-based jurors

Project description

OpenJury 🏛️

A Python SDK for evaluating your agent's response quality using a configurable panel of LLM judges.

Python 3.11+ License: Apache 2.0


Overview

OpenJury is an agent evaluation framework. Point it at your agent's HTTP endpoint and it will:

  1. Send a prompt to your agent and collect the response
  2. Pass the response to a panel of LLM judges (jurors), each scoring it against your criteria
  3. Return a composite quality score with a full statistical breakdown

The primary output is a single composite_score — a weighted mean of all juror scores across all criteria, plus eight additional canned metrics (median, harmonic mean, weakest link, juror agreement, and more). You can also register a custom scoring function for domain-specific logic.

Why a jury instead of a single judge?

Relying on one LLM to evaluate outputs is common but fragile: it's expensive and prone to intra-model bias. Research from Cohere shows that a panel of smaller, diverse models produces more reliable and less biased evaluations at lower cost.

OpenJury makes this practical: configure jurors declaratively in JSON, wire rubrics per criterion for consistent scoring, and get a structured result you can act on.


Installation

Requirements: Python 3.11 or newer

pip install openjury

From source

git clone https://github.com/robiscoding/openjury.git
cd openjury
pip install -e .
uv pip install -e ".[dev]"     # optional dev dependencies

Choose your path

Track Goal Time API keys?
Try it See output shape, understand AgentEvalResult 2 min No
Evaluate my agent Full fetch + score pipeline 10 min Yes
Production integrate Batch, resume, custom scoring, CI 30+ min Yes

Try it (no agent, no keys)

pip install openjury
python examples/hello_world/score_existing.py

examples/hello_world/ · offline demo with sample output

Evaluate my agent

# Terminal 1 — mock agent (or use your own endpoint)
python examples/tools/mock_agent.py --port 8080

# Terminal 2
export OPENAI_API_KEY="..." AGENT_API_KEY=demo
python examples/basic_usage/basic_jury_run.py

examples/basic_usage/ · docs/endpoint-config.md

Go deeper

  • docs/ — architecture, config schema, composable API, CLI
  • recipes/ — task-oriented how-tos
  • notebooks/ — interactive Jupyter walkthroughs
  • examples/ — full examples index

Quick Start

1. Create a jury config

Set a jury-level llm_provider for shared credentials. Jurors inherit it by default. Use ${ENV_VAR} for secrets.

{
  "name": "Customer Support Jury",
  "score_scale": 5,
  "llm_provider": {
    "provider": "openai_compatible",
    "model_name": "gpt-4o-mini",
    "api_key": "${OPENAI_API_KEY}"
  },
  "jurors": [
    { "name": "Support Expert", "system_prompt": "You are a senior support manager.", "weight": 2.0 },
    { "name": "Customer Perspective", "weight": 1.0 }
  ],
  "criteria": [
    {
      "name": "helpfulness",
      "description": "Does the response resolve the customer's issue?",
      "weight": 2.0,
      "rubric": {
        "1": "Ignores or misunderstands the question",
        "3": "Partially addresses the question",
        "5": "Directly and completely resolves the issue"
      }
    },
    {
      "name": "accuracy",
      "description": "Is the information factually correct?",
      "weight": 2.0,
      "rubric": {
        "1": "Contains factual errors",
        "3": "Mostly accurate with minor gaps",
        "5": "Completely accurate"
      }
    }
  ]
}

Full field reference: docs/config-schema.md

2. Run an evaluation

from openjury import JuryConfig, OpenJury, ResultFormatter
from openjury.endpoint_fetcher import AgentEndpoint

jury = OpenJury(JuryConfig.from_json_file("jury_config.json"))

endpoint = AgentEndpoint(
    url="http://localhost:8080/v1/chat/completions",
    alias="my-agent",
    headers={"Authorization": "Bearer ${AGENT_API_KEY}"},
    request_body_template={
        "model": "my-model",
        "messages": [{"role": "user", "content": "{prompt}"}],
    },
)

result = jury.evaluate(prompt="How do I reset my password?", endpoint=endpoint)
print(ResultFormatter.format_result(result))
print(f"Score: {result.composite_score:.2f} / {result.score_scale}")

score_response() is a backward-compatible alias for evaluate().

CLI:

openjury run \
  --config jury_config.json \
  --endpoints-config endpoints.json \
  --prompt "How do I reset my password?"

3. Read the output

╔══ Quality Evaluation  (scale: 1–5) ══
  composite_score:   3.87 / 5  (0.774 normalized)
  juror_agreement (0–1)        0.880   ← 1 = unanimous
  ...
  • composite_score — headline quality number (weighted_mean from trial 1)
  • juror_agreement — near 1.0 = high confidence; near 0 = contested
  • weakest_link — flags a standout failure even when composite looks okay

Key Features

  • Agent Evaluation — score a single agent response per prompt
  • Structured Rubrics — score anchors per criterion improve inter-juror reliability
  • Eight Canned Metrics — weighted mean, median, harmonic mean, weakest link, juror agreement, and more
  • Custom Scoring — register a Python function for domain-specific composite logic
  • Consistency Auditnum_trials > 1 measures response reliability
  • Batch Evaluation — JSONL/CSV datasets via CLI or evaluate_items()
  • Parallel Processing — concurrent jurors and batch items

Examples

Example What it shows
examples/hello_world/ Offline demo — no agent, no API keys
examples/basic_usage/ Single prompt, full pipeline, reading AgentEvalResult
examples/provider_configs/ OpenAI, OpenRouter, Ollama, mixed providers
examples/batch_eval/ JSONL/CSV dataset, batch-eval CLI
examples/custom_scoring/ ScoreAggregator.register(), safety-gate pattern
examples/consistency_audit/ num_trials > 1, ConsistencyResult.score_std
examples/resume_evaluation/ Fetch/score split, crash recovery
examples/web_server/ Flask API wrapping evaluation
examples/tools/ Mock agent for local development

Full index: examples/README.md


Troubleshooting

Symptom Fix
ConfigurationError for ${VAR} Export env vars before OpenJury(...). See provider-config.md
Partial juror override ValidationError Set all three: model_name, api_key, provider. See config-schema.md
JurorException: missing criterion Juror JSON keys must match criteria[].name exactly
EndpointFetchError Check URL, headers, response_path. See endpoint-config.md
Low juror_agreement Add rubrics, lower juror temperature. See recipes/design-rubrics.md

Documentation

Resource Description
docs/ Architecture, config schema, API, CLI
recipes/ Task-oriented cookbook
notebooks/ Interactive tutorials
CONTRIBUTING.md Development setup

Advanced topics (moved from this README for brevity):


Use Cases

  • Customer support agents — score helpfulness, accuracy, and tone per response
  • Code review assistants — evaluate correctness, readability, and security
  • Content generation — assess clarity, tone, and factuality before publishing
  • Production monitoring — track composite_score drift between model versions
  • Consistency testing — run num_trials=3 before shipping a prompt change

License

Apache License 2.0. See LICENSE.


Contributing

Contributions welcome! See CONTRIBUTING.md.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

openjury-0.2.0.tar.gz (48.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

openjury-0.2.0-py3-none-any.whl (41.0 kB view details)

Uploaded Python 3

File details

Details for the file openjury-0.2.0.tar.gz.

File metadata

  • Download URL: openjury-0.2.0.tar.gz
  • Upload date:
  • Size: 48.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for openjury-0.2.0.tar.gz
Algorithm Hash digest
SHA256 60e42d690dec84f3ff78681040565dfa2d7285cbd08fd9356b9301708e0c0eff
MD5 a0383b66278ca39d50b9d658105a91e8
BLAKE2b-256 4f91b78e2a564bd219469e643c72a68401fbad3f41621eb7dea9d5a6dbcfef32

See more details on using hashes here.

File details

Details for the file openjury-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: openjury-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 41.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for openjury-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 36fba33cec3cfbdafda7d5c0d96d1c78e1a6ea3027178ef8057b0e4315652faf
MD5 1e6ff0e31e7366460794267a9165d94a
BLAKE2b-256 112f10d65494d38d00a4f0409f89d6667ef1c87015806121eb319b3e641d3515

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page