Skip to main content

Python SDK for evaluating multiple model outputs using configurable LLM-based jurors

Project description

OpenJury 🏛️

A Python SDK for evaluating and comparing multiple model outputs using configurable LLM-based jurors.

Python 3.11+ License: Apache 2.0


Overview

OpenJury is a post-inference ensemble framework that evaluates and compares multiple model outputs using configurable LLM-based jurors. It enables structured model assessment, ranking, and A/B testing directly into your Python apps, research workflows, or ML platforms.

At its core, OpenJury is a decision-level, LLM-driven evaluation system that aggregates juror scores using flexibile voting strategies (e.g. weighted, ranked, consensus, etc.). This makes it a powerful and extensible solution for nuanced, after-inference comparison of generated outputs across models, prompts, versions, or datasets.

Why use an LLM Jury?

AI models can generate fluent, convincing outputs, but fluency != correctness. Whether you're building a customer service agent, a code review assist, or a content generator, you need to know which response is best, correct, or how models compare with quality and consistency. Human evaluation doesn't scale, which is why LLM-based jurors are widely used.

But relying on a single LLM (like GPT-4o) to evaluate model outputs, although common, is expensive and can introduce intra-model bias. Research by Cohere shows that using a panel of smaller, diverse models not only cuts cost but also leads to more reliable and less biased evaluations.

OpenJury puts this into practice: instead of a single judge, it uses multiple jurors to score and explain outputs. The result? Better evlautions and lower costs, all configurable with a declarative interface.


Key Features

  • Python SDK: Simple integration, flexible configuration
  • Multi-Criteria Evaluation: Define custom criteria with weights and scoring
  • Advanced Voting Methods: Majority, average, weighted, ranked, consensus, or your own
  • Parallel Processing: Evaluate at scale, concurrently
  • Rich Output: Scores, explanations, voting breakdowns, and confidence metrics
  • Extensible: Plug in your own jurors, voting logic, and evaluation strategies
  • Dev Experience: One-command setup, Makefile workflow, and modern code quality tools

Installation

Requirements: Python 3.11 or newer

Recommended (PyPI)

pip install openjury

From Source (for development/contribution)

git clone https://github.com/robiscoding/openjury.git
cd openjury
pip install -e .
uv pip install -e ".[dev]"     # (optional) dev dependencies

Quick Start

Set Environment Variables

export OPENROUTER_API_KEY="your-api-key"

or if you're using OpenAI:

export LLM_PROVIDER="openai"
export OPENAI_API_KEY="your-api-key"

Basic Usage

from openjury import OpenJury, JuryConfig

config = JuryConfig.from_json_file("jury_config.json")
jury = OpenJury(config)
verdict = jury.evaluate(
    prompt="Write a Python function to reverse a string",
    responses=[
        "def reverse(s): return s[::-1]",
        "def reverse(s): return ''.join(reversed(s))"
    ]
)

print(f"Winner: {verdict.final_verdict.winner}")
print(f"Confidence: {verdict.final_verdict.confidence:.2%}")

Configuration Example (jury_config.json)

{
  "name": "Code Quality Jury",
  "criteria": [
    {"name": "correctness", "weight": 2.0, "max_score": 5},
    {"name": "readability", "weight": 1.5, "max_score": 5}
  ],
  "jurors": [
    {"name": "Senior Developer", "system_prompt": "You are a senior developer. You are tasked with reviewing the code and providing a score and explanation for the correctness and readability of the code.", "model_name": "qwen/qwen-2.5-coder-32b", "weight": 2.0},
    {"name": "Code Reviewer", "system_prompt": "You are a code reviewer. You are tasked with reviewing the code and providing a score and explanation for the correctness and readability of the code.", "model_name": "llama3/llama-3.1-8b-instruct", "weight": 1.0}
  ],
  "voting_method": "weighted"
}

Examples

You can find more examples in the examples directory.

Use Cases

Model Evaluation & Comparison

  • Compare outputs from different models (e.g., GPT-4 vs Claude vs custom models)
  • Run A/B tests across prompt variations, fine-tuned models, or versions

Content & Response Quality

  • Evaluate generated code for correctness and readability
  • Score long-form content (blogs, papers, explanations) for clarity, tone, or coherence

Automated Grading & Assessment

  • Grade student answers or interview responses at scale
  • Score generated outputs against rubric-style criteria

Production Monitoring & QA

  • Monitor output quality in production systems
  • Detect degradation or drift between model versions

Custom Evaluation Workflows

  • Integrate LLM-based judgment into human-in-the-loop pipelines
  • Use configurable jurors and voting for domain-specific tasks

License

OpenJury is licensed under the Apache License 2.0. See the LICENSE file for details.


Contributing

Contributions are welcome! Please see the CONTRIBUTING.md file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

openjury-0.1.0.tar.gz (21.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

openjury-0.1.0-py3-none-any.whl (20.1 kB view details)

Uploaded Python 3

File details

Details for the file openjury-0.1.0.tar.gz.

File metadata

  • Download URL: openjury-0.1.0.tar.gz
  • Upload date:
  • Size: 21.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.13

File hashes

Hashes for openjury-0.1.0.tar.gz
Algorithm Hash digest
SHA256 11cd68d36324fd12206ce237f755c7d499a8265de0090ee9898214963c71b986
MD5 40855c006732ba4ac1d5a026db50d552
BLAKE2b-256 dc9ce8c5e7e8bd4fdc4251471393dfbf19c467b58d2d43144c9a2394f8e9a41a

See more details on using hashes here.

File details

Details for the file openjury-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: openjury-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 20.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.13

File hashes

Hashes for openjury-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8bf923df2e8a7de093a664643b705dbe22a15cb37e35d0a26c2558404ba1ebb9
MD5 61023d2ac58c0dc9adeeee962812b199
BLAKE2b-256 0aac97ebbe9643b96babd03423d420593b520c3dce827494df86c22930ca4d6f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page