Python SDK for evaluating multiple model outputs using configurable LLM-based jurors

These details have not been verified by PyPI

Project links

Project description

OpenJury 🏛️

A Python SDK for evaluating your agent's response quality using a configurable panel of LLM judges.

Overview

OpenJury is an agent evaluation framework. Point it at your agent's HTTP endpoint and it will:

Send a prompt to your agent and collect the response
Pass the response to a panel of LLM judges (jurors), each scoring it against your criteria
Return a composite quality score with a full statistical breakdown

The primary output is a single composite_score — a weighted mean of all juror scores across all criteria, plus eight additional canned metrics (median, harmonic mean, weakest link, juror agreement, and more). You can also register a custom scoring function for domain-specific logic.

Why a jury instead of a single judge?

Relying on one LLM to evaluate outputs is common but fragile: it's expensive and prone to intra-model bias. Research from Cohere shows that a panel of smaller, diverse models produces more reliable and less biased evaluations at lower cost.

OpenJury makes this practical: configure jurors declaratively in JSON, wire rubrics per criterion for consistent scoring, and get a structured result you can act on.

Installation

Requirements: Python 3.11 or newer

pip install openjury

From source

git clone https://github.com/robiscoding/openjury.git
cd openjury
pip install -e .
uv pip install -e ".[dev]"     # optional dev dependencies

Choose your path

Track	Goal	Time	API keys?
Try it	See output shape, understand `AgentEvalResult`	2 min	No
Evaluate my agent	Full fetch + score pipeline	10 min	Yes
Production integrate	Batch, resume, custom scoring, CI	30+ min	Yes

Try it (no agent, no keys)

pip install openjury
python examples/hello_world/score_existing.py

→ examples/hello_world/ · offline demo with sample output

Evaluate my agent

# Terminal 1 — mock agent (or use your own endpoint)
python examples/tools/mock_agent.py --port 8080

# Terminal 2
export OPENAI_API_KEY="..." AGENT_API_KEY=demo
python examples/basic_usage/basic_jury_run.py

→ examples/basic_usage/ · docs/endpoint-config.md

Go deeper

docs/ — architecture, config schema, composable API, CLI
recipes/ — task-oriented how-tos
notebooks/ — interactive Jupyter walkthroughs
examples/ — full examples index

Quick Start

1. Create a jury config

Set a jury-level llm_provider for shared credentials. Jurors inherit it by default. Use ${ENV_VAR} for secrets.

{
  "name": "Customer Support Jury",
  "score_scale": 5,
  "llm_provider": {
    "provider": "openai_compatible",
    "model_name": "gpt-4o-mini",
    "api_key": "${OPENAI_API_KEY}"
  },
  "jurors": [
    { "name": "Support Expert", "system_prompt": "You are a senior support manager.", "weight": 2.0 },
    { "name": "Customer Perspective", "weight": 1.0 }
  ],
  "criteria": [
    {
      "name": "helpfulness",
      "description": "Does the response resolve the customer's issue?",
      "weight": 2.0,
      "rubric": {
        "1": "Ignores or misunderstands the question",
        "3": "Partially addresses the question",
        "5": "Directly and completely resolves the issue"
      }
    },
    {
      "name": "accuracy",
      "description": "Is the information factually correct?",
      "weight": 2.0,
      "rubric": {
        "1": "Contains factual errors",
        "3": "Mostly accurate with minor gaps",
        "5": "Completely accurate"
      }
    }
  ]
}

Full field reference: docs/config-schema.md

2. Run an evaluation

from openjury import JuryConfig, OpenJury, ResultFormatter
from openjury.endpoint_fetcher import AgentEndpoint

jury = OpenJury(JuryConfig.from_json_file("jury_config.json"))

endpoint = AgentEndpoint(
    url="http://localhost:8080/v1/chat/completions",
    alias="my-agent",
    headers={"Authorization": "Bearer ${AGENT_API_KEY}"},
    request_body_template={
        "model": "my-model",
        "messages": [{"role": "user", "content": "{prompt}"}],
    },
)

result = jury.evaluate(prompt="How do I reset my password?", endpoint=endpoint)
print(ResultFormatter.format_result(result))
print(f"Score: {result.composite_score:.2f} / {result.score_scale}")

score_response() is a backward-compatible alias for evaluate().

CLI:

openjury run \
  --config jury_config.json \
  --endpoints-config endpoints.json \
  --prompt "How do I reset my password?"

3. Read the output

╔══ Quality Evaluation  (scale: 1–5) ══
  composite_score:   3.87 / 5  (0.774 normalized)
  juror_agreement (0–1)        0.880   ← 1 = unanimous
  ...

composite_score — headline quality number (weighted_mean from trial 1)
juror_agreement — near 1.0 = high confidence; near 0 = contested
weakest_link — flags a standout failure even when composite looks okay

Key Features

Agent Evaluation — score a single agent response per prompt
Structured Rubrics — score anchors per criterion improve inter-juror reliability
Eight Canned Metrics — weighted mean, median, harmonic mean, weakest link, juror agreement, and more
Custom Scoring — register a Python function for domain-specific composite logic
Consistency Audit — num_trials > 1 measures response reliability
Batch Evaluation — JSONL/CSV datasets via CLI or evaluate_items()
Parallel Processing — concurrent jurors and batch items

Examples

Example	What it shows
`examples/hello_world/`	Offline demo — no agent, no API keys
`examples/basic_usage/`	Single prompt, full pipeline, reading `AgentEvalResult`
`examples/provider_configs/`	OpenAI, OpenRouter, Ollama, mixed providers
`examples/batch_eval/`	JSONL/CSV dataset, `batch-eval` CLI
`examples/custom_scoring/`	`ScoreAggregator.register()`, safety-gate pattern
`examples/consistency_audit/`	`num_trials > 1`, `ConsistencyResult.score_std`
`examples/resume_evaluation/`	Fetch/score split, crash recovery
`examples/web_server/`	Flask API wrapping evaluation
`examples/tools/`	Mock agent for local development

Full index: examples/README.md

Troubleshooting

Symptom	Fix
`ConfigurationError` for `${VAR}`	Export env vars before `OpenJury(...)`. See provider-config.md
Partial juror override `ValidationError`	Set all three: `model_name`, `api_key`, `provider`. See config-schema.md
`JurorException: missing criterion`	Juror JSON keys must match `criteria[].name` exactly
`EndpointFetchError`	Check URL, headers, `response_path`. See endpoint-config.md
Low `juror_agreement`	Add rubrics, lower juror temperature. See recipes/design-rubrics.md

Documentation

Resource	Description
docs/	Architecture, config schema, API, CLI
recipes/	Task-oriented cookbook
notebooks/	Interactive tutorials
CONTRIBUTING.md	Development setup

Advanced topics (moved from this README for brevity):

Composable API — fetch/score split, batch, serialization
Batch evaluation — JSONL datasets
Consistency audit — num_trials
Custom scoring — safety gates
Provider setup — OpenAI, OpenRouter, Anthropic, Ollama

Use Cases

Customer support agents — score helpfulness, accuracy, and tone per response
Code review assistants — evaluate correctness, readability, and security
Content generation — assess clarity, tone, and factuality before publishing
Production monitoring — track composite_score drift between model versions
Consistency testing — run num_trials=3 before shipping a prompt change

License

Apache License 2.0. See LICENSE.

Contributing

Contributions welcome! See CONTRIBUTING.md.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.5.0

Jul 3, 2026

0.4.0

Jul 3, 2026

This version

0.3.0

Jun 30, 2026

0.2.0

Jun 27, 2026

0.1.0

Aug 1, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

openjury-0.3.0.tar.gz (52.3 kB view details)

Uploaded Jun 30, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

openjury-0.3.0-py3-none-any.whl (43.7 kB view details)

Uploaded Jun 30, 2026 Python 3

File details

Details for the file openjury-0.3.0.tar.gz.

File metadata

Download URL: openjury-0.3.0.tar.gz
Upload date: Jun 30, 2026
Size: 52.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for openjury-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`fed946812216dec103fb21579db6e5a4ac122f95bc8e7a6aeaf70bc735b60a0a`
MD5	`5a03e7004c85c2ef544a484dd890d75d`
BLAKE2b-256	`5c4c9aae647051d2bfbad0cdefa7eea13452bce8aaff975a97429756040c1f40`

See more details on using hashes here.

File details

Details for the file openjury-0.3.0-py3-none-any.whl.

File metadata

Download URL: openjury-0.3.0-py3-none-any.whl
Upload date: Jun 30, 2026
Size: 43.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for openjury-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`335a77be30d18026df7ad7dbc1a9e84b652913a6bbe41e50f3d8f2a52a27aec7`
MD5	`53450c89f7432a669576ece5e4754234`
BLAKE2b-256	`b9625de75482904faf99a332ca1c0f066dae7e5b5f72b1ffdf0194a16cb10bc5`

See more details on using hashes here.

openjury 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

OpenJury 🏛️

Overview

Why a jury instead of a single judge?

Installation

From source

Choose your path

Try it (no agent, no keys)

Evaluate my agent

Go deeper

Quick Start

1. Create a jury config

2. Run an evaluation

3. Read the output

Key Features

Examples

Troubleshooting

Documentation

Use Cases

License

Contributing

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes