Python SDK for evaluating multiple model outputs using configurable LLM-based jurors
Project description
OpenJury 🏛️
A Python SDK for evaluating your agent's response quality using a configurable panel of LLM judges.
Overview
OpenJury is an agent evaluation framework. Point it at your agent's HTTP endpoint and it will:
- Send a prompt to your agent and collect the response
- Pass the response to a panel of LLM judges (jurors), each scoring it against your criteria
- Return a composite quality score with a full statistical breakdown
The primary output is a single composite_score — a weighted mean of all juror scores across all criteria, plus eight additional canned metrics (median, harmonic mean, weakest link, juror agreement, and more). You can also register a custom scoring function for domain-specific logic.
Why a jury instead of a single judge?
Relying on one LLM to evaluate outputs is common but fragile: it's expensive and prone to intra-model bias. Research from Cohere shows that a panel of smaller, diverse models produces more reliable and less biased evaluations at lower cost.
OpenJury makes this practical: configure jurors declaratively in JSON, wire rubrics per criterion for consistent scoring, and get a structured result you can act on.
Installation
Requirements: Python 3.11 or newer
pip install openjury
From source
git clone https://github.com/robiscoding/openjury.git
cd openjury
pip install -e .
uv pip install -e ".[dev]" # optional dev dependencies
Choose your path
| Track | Goal | Time | API keys? |
|---|---|---|---|
| Try it | See output shape, understand AgentEvalResult |
2 min | No |
| Evaluate my agent | Full fetch + score pipeline | 10 min | Yes |
| Production integrate | Batch, resume, custom scoring, CI | 30+ min | Yes |
Try it (no agent, no keys)
pip install openjury
python examples/hello_world/score_existing.py
→ examples/hello_world/ · offline demo with sample output
Evaluate my agent
# Terminal 1 — mock agent (or use your own endpoint)
python examples/tools/mock_agent.py --port 8080
# Terminal 2
export OPENAI_API_KEY="..." AGENT_API_KEY=demo
python examples/basic_usage/basic_jury_run.py
→ examples/basic_usage/ · docs/endpoint-config.md
Go deeper
- docs/ — architecture, config schema, composable API, CLI
- recipes/ — task-oriented how-tos
- notebooks/ — interactive Jupyter walkthroughs
- examples/ — full examples index
Quick Start
1. Create a jury config
Set a jury-level llm_provider for shared credentials. Jurors inherit it by default. Use ${ENV_VAR} for secrets.
{
"name": "Customer Support Jury",
"score_scale": 5,
"llm_provider": {
"provider": "openai_compatible",
"model_name": "gpt-4o-mini",
"api_key": "${OPENAI_API_KEY}"
},
"jurors": [
{ "name": "Support Expert", "system_prompt": "You are a senior support manager.", "weight": 2.0 },
{ "name": "Customer Perspective", "weight": 1.0 }
],
"criteria": [
{
"name": "helpfulness",
"description": "Does the response resolve the customer's issue?",
"weight": 2.0,
"rubric": {
"1": "Ignores or misunderstands the question",
"3": "Partially addresses the question",
"5": "Directly and completely resolves the issue"
}
},
{
"name": "accuracy",
"description": "Is the information factually correct?",
"weight": 2.0,
"rubric": {
"1": "Contains factual errors",
"3": "Mostly accurate with minor gaps",
"5": "Completely accurate"
}
}
]
}
Full field reference: docs/config-schema.md
2. Run an evaluation
from openjury import JuryConfig, OpenJury, ResultFormatter
from openjury.endpoint_fetcher import AgentEndpoint
jury = OpenJury(JuryConfig.from_json_file("jury_config.json"))
endpoint = AgentEndpoint(
url="http://localhost:8080/v1/chat/completions",
alias="my-agent",
headers={"Authorization": "Bearer ${AGENT_API_KEY}"},
request_body_template={
"model": "my-model",
"messages": [{"role": "user", "content": "{prompt}"}],
},
)
result = jury.evaluate(prompt="How do I reset my password?", endpoint=endpoint)
print(ResultFormatter.format_result(result))
print(f"Score: {result.composite_score:.2f} / {result.score_scale}")
score_response() is a backward-compatible alias for evaluate().
CLI:
openjury run \
--config jury_config.json \
--endpoints-config endpoints.json \
--prompt "How do I reset my password?"
3. Read the output
╔══ Quality Evaluation (scale: 1–5) ══
composite_score: 3.87 / 5 (0.774 normalized)
juror_agreement (0–1) 0.880 ← 1 = unanimous
...
composite_score— headline quality number (weighted_meanfrom trial 1)juror_agreement— near 1.0 = high confidence; near 0 = contestedweakest_link— flags a standout failure even when composite looks okay
Key Features
- Agent Evaluation — score a single agent response per prompt
- Structured Rubrics — score anchors per criterion improve inter-juror reliability
- Eight Canned Metrics — weighted mean, median, harmonic mean, weakest link, juror agreement, and more
- Custom Scoring — register a Python function for domain-specific composite logic
- Consistency Audit —
num_trials > 1measures response reliability - Batch Evaluation — inline config datasets, JSONL/CSV files, or
evaluate_items() - Parallel Processing — concurrent jurors and batch items
Examples
| Example | What it shows |
|---|---|
examples/hello_world/ |
Offline demo — no agent, no API keys |
examples/basic_usage/ |
Single prompt, full pipeline, reading AgentEvalResult |
examples/provider_configs/ |
OpenAI, OpenRouter, Ollama, mixed providers |
examples/batch_eval/ |
Inline/JSONL/CSV datasets, batch-eval CLI |
examples/custom_scoring/ |
ScoreAggregator.register(), safety-gate pattern |
examples/consistency_audit/ |
num_trials > 1, ConsistencyResult.score_std |
examples/resume_evaluation/ |
Fetch/score split, crash recovery |
examples/web_server/ |
Flask API wrapping evaluation |
examples/tools/ |
Mock agent for local development |
Full index: examples/README.md
Troubleshooting
| Symptom | Fix |
|---|---|
ConfigurationError for ${VAR} |
Export env vars before OpenJury(...). See provider-config.md |
Partial juror override ValidationError |
Set all three: model_name, api_key, provider. See config-schema.md |
JurorException: missing criterion |
Juror JSON keys must match criteria[].name exactly |
EndpointFetchError |
Check URL, headers, response_path. See endpoint-config.md |
Low juror_agreement |
Add rubrics, lower juror temperature. See recipes/design-rubrics.md |
Documentation
| Resource | Description |
|---|---|
| docs/ | Architecture, config schema, API, CLI |
| recipes/ | Task-oriented cookbook |
| notebooks/ | Interactive tutorials |
| CONTRIBUTING.md | Development setup |
Advanced topics (moved from this README for brevity):
- Composable API — fetch/score split, batch, serialization
- Batch evaluation — inline and JSONL/CSV datasets
- Consistency audit —
num_trials - Custom scoring — safety gates
- Provider setup — OpenAI, OpenRouter, Anthropic, Ollama
Use Cases
- Customer support agents — score helpfulness, accuracy, and tone per response
- Code review assistants — evaluate correctness, readability, and security
- Content generation — assess clarity, tone, and factuality before publishing
- Production monitoring — track
composite_scoredrift between model versions - Consistency testing — run
num_trials=3before shipping a prompt change
License
Apache License 2.0. See LICENSE.
Contributing
Contributions welcome! See CONTRIBUTING.md.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file openjury-0.4.0.tar.gz.
File metadata
- Download URL: openjury-0.4.0.tar.gz
- Upload date:
- Size: 58.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1de07a27ae00fdf2baf938b04cd36050fa52af10c7e101d56fa886fda2f883ea
|
|
| MD5 |
2e1e5b206b2fa478375a8f51bcfb8d4e
|
|
| BLAKE2b-256 |
10cbfe349732df5168ced1da569f859a693be80b2993482aa645609dbcb57489
|
File details
Details for the file openjury-0.4.0-py3-none-any.whl.
File metadata
- Download URL: openjury-0.4.0-py3-none-any.whl
- Upload date:
- Size: 47.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
63dc5aece511685bdc46853368d0835ce8b31f41bf086b74510a8d8cc0f7a85e
|
|
| MD5 |
d49c1c628485a32d31cd09a2ad04dab9
|
|
| BLAKE2b-256 |
f57b8078e2e961dd2154b7f244808d9dee25f61757678a41d6d97ae419593220
|