Skip to main content

OpenJudge: A Unified Framework for Holistic Evaluation and Quality Reward

Project description

Open-Judge Logo

Holistic Evaluation, Quality Rewards: Driving Application Excellence

๐ŸŒŸ If you find OpenJudge helpful, please give us a Star! ๐ŸŒŸ

Python 3.10+ PyPI Documentation Website Try Online

๐ŸŒ Website | ๐Ÿš€ Try Online | ๐Ÿ“– Documentation | ๐Ÿค Contributing | ไธญๆ–‡

OpenJudge is an open-source evaluation framework for AI applications (e.g., AI agents or chatbots) designed to evaluate quality and drive continuous application optimization.

In practice, application excellence depends on a trustworthy evaluation workflow: Collect test data โ†’ Define graders โ†’ Run evaluation at scale โ†’ Analyze weaknesses โ†’ Iterate quickly.

OpenJudge provides ready-to-use graders and supports generating scenario-specific rubrics (as graders), making this workflow simpler, more professional, and easy to integrate into your workflow. It can also convert grading results into reward signals to help you fine-tune and optimize your application.

๐Ÿš€ Try it now! Visit openjudge.me/app to use graders online โ€” no installation required. Test built-in graders, build custom rubrics, and explore evaluation results directly in your browser.


๐Ÿ“‘ Table of Contents


News

  • 2026-04-07 - ๐Ÿ”’ Skill Graders - 5 new LLM-based graders for evaluating AI Agent Skill packages: threat analysis (AITech taxonomy), declaration alignment, completeness, relevance, and design quality. ๐Ÿ‘‰ Documentation | Cookbook

  • 2026-03-10 - ๐Ÿ› ๏ธ New Skills - Claude authenticity verification, find skills combo, and more. ๐Ÿ‘‰ Browse Skills

  • 2026-02-12 - ๐Ÿ“š Reference Hallucination Arena - Benchmark for evaluating LLM academic reference hallucination. ๐Ÿ‘‰ Documentation | ๐Ÿ“Š Leaderboard

  • 2026-01-27 - ๐Ÿ†• Paper Review - Automatically review academic papers using LLM-powered evaluation. ๐Ÿ‘‰ Documentation

  • 2026-01-27 - ๐Ÿ–ฅ๏ธ OpenJudge UI - A Streamlit-based visual interface for grader testing and Auto Arena. ๐Ÿ‘‰ Try Online | Run locally: streamlit run ui/app.py


โœจ Key Features

๐Ÿ“ฆ Systematic & Quality-Assured Grader Library

Access 50+ production-ready graders featuring a comprehensive taxonomy, rigorously validated for reliable performance.

๐ŸŽฏ General

Focus: Semantic quality, functional correctness, structural compliance

Key Graders:

  • Relevance - Semantic relevance scoring
  • Similarity - Text similarity measurement
  • Syntax Check - Code syntax validation
  • JSON Match - Structure compliance

๐Ÿค– Agent

Focus: Agent lifecycle, tool calling, memory, plan feasibility, trajectory quality

Key Graders:

  • Tool Selection - Tool choice accuracy
  • Memory - Context preservation
  • Plan - Strategy feasibility
  • Trajectory - Path optimization

๐Ÿ–ผ๏ธ Multimodal

Focus: Image-text coherence, visual generation quality, image helpfulness

Key Graders:

  • Image Coherence - Visual-text alignment
  • Text-to-Image - Generation quality
  • Image Helpfulness - Image contribution
  • ๐ŸŒ Multi-Scenario Coverage: Extensive support for diverse domains including Agent, text, code, math, and multimodal tasks. ๐Ÿ‘‰ Explore Supported Scenarios
  • ๐Ÿ”„ Holistic Agent Evaluation: Beyond final outcomes, we assess the entire lifecycleโ€”including trajectories, Memory, Reflection, and Tool Use. ๐Ÿ‘‰ Agent Lifecycle Evaluation
  • โœ… Quality Assurance: Every grader comes with benchmark datasets and pytest integration for validation. ๐Ÿ‘‰ View Benchmark Datasets

๐Ÿ› ๏ธ Flexible Grader Building Methods

Choose the build method that fits your requirements:

  • Customization: Clear requirements, but no existing grader? If you have explicit rules or logic, use our Python interfaces or Prompt templates to quickly define your own grader. ๐Ÿ‘‰ Custom Grader Development Guide
  • Zero-shot Rubrics Generation: Not sure what criteria to use, and no labeled data yet? Just provide a task description and optional sample queriesโ€”the LLM will automatically generate evaluation rubrics for you. Ideal for rapid prototyping when you want to get started immediately. ๐Ÿ‘‰ Zero-shot Rubrics Generation Guide
  • Data-driven Rubrics Generation: Ambiguous requirements, but have few examples? Use the GraderGenerator to automatically summarize evaluation Rubrics from your annotated data, and generate a llm-based grader. ๐Ÿ‘‰ Data-driven Rubrics Generation Guide
  • Training Judge Models: Massive data and need peak performance? Use our training pipeline to train a dedicated Judge Model. This is ideal for complex scenarios where prompt-based grading falls short.๐Ÿ‘‰ Train Judge Models

๐Ÿ”Œ Easy Integration

Using mainstream observability platforms like LangSmith or Langfuse? We offer seamless integration to enhance their evaluators and automated evaluation capabilities. We also provide integrations with training frameworks like VERL for RL training. ๐Ÿ‘‰ See Integrations for details

๐ŸŒ Online Playground

Explore OpenJudge without writing a single line of code. Our online platform at openjudge.me/app lets you:

  • Test graders interactively โ€” select a built-in grader, input your data, and see results instantly
  • Build custom rubrics โ€” use the zero-shot generator to create graders from task descriptions
  • View leaderboards โ€” compare model performance across evaluation benchmarks at openjudge.me/leaderboard

๐Ÿ“ฅ Installation

๐Ÿ’ก Don't want to install anything? Try OpenJudge online โ€” use graders directly in your browser, no setup needed.

pip install py-openjudge

๐Ÿ’ก More installation methods can be found in the Quickstart Guide.


๐Ÿš€ Quickstart

๐Ÿ“š Complete Quickstart can be found in the Quickstart Guide.

Simple Example

A simple example to evaluate a single response:

import asyncio
from openjudge.models import OpenAIChatModel
from openjudge.graders.common.relevance import RelevanceGrader

async def main():
    # 1๏ธโƒฃ Create model client
    model = OpenAIChatModel(model="qwen3-32b")
    # 2๏ธโƒฃ Initialize grader
    grader = RelevanceGrader(model=model)
    # 3๏ธโƒฃ Prepare data
    data = {
        "query": "What is machine learning?",
        "response": "Machine learning is a subset of AI that enables computers to learn from data.",
    }
    # 4๏ธโƒฃ Evaluate
    result = await grader.aevaluate(**data)
    print(f"Score: {result.score}")   # Score: 4
    print(f"Reason: {result.reason}")

if __name__ == "__main__":
    asyncio.run(main())

Evaluate LLM Applications with Built-in Graders

Use multiple built-in graders to comprehensively evaluate your LLM application: ๐Ÿ‘‰ Explore All built-in graders

Business Scenario: Evaluating an e-commerce customer service agent that handles order inquiries. We assess the agent's performance across three dimensions: relevance, hallucination, and tool selection.

import asyncio
from openjudge.models import OpenAIChatModel
from openjudge.graders.common import RelevanceGrader, HallucinationGrader
from openjudge.graders.agent.tool.tool_selection import ToolSelectionGrader
from openjudge.runner import GradingRunner
from openjudge.runner.aggregator import WeightedSumAggregator
from openjudge.analyzer.statistical import DistributionAnalyzer

TOOL_DEFINITIONS = [
    {"name": "query_order", "description": "Query order status and logistics information", "parameters": {"order_id": "str"}},
    {"name": "query_logistics", "description": "Query detailed logistics tracking", "parameters": {"order_id": "str"}},
    {"name": "estimate_delivery", "description": "Estimate delivery time", "parameters": {"order_id": "str"}},
]
# Prepare your dataset
dataset = [{
    "query": "Where is my order ORD123456?",
    "response": "Your order ORD123456 has arrived at the Beijing distribution center and is expected to arrive tomorrow.",
    "context": "Order ORD123456: Arrived at Beijing distribution center, expected to arrive tomorrow.",
    "tool_definitions": TOOL_DEFINITIONS,
    "tool_calls": [{"name": "query_order", "arguments": {"order_id": "ORD123456"}}],
    # ... more test cases
}]
async def main():
    # 1๏ธโƒฃ Initialize judge model
    model = OpenAIChatModel(model="qwen3-max")
    # 2๏ธโƒฃ Configure multiple graders
    grader_configs = {
        "relevance": {"grader": RelevanceGrader(model=model), "mapper": {"query": "query", "response": "response"}},
        "hallucination": {"grader": HallucinationGrader(model=model), "mapper": {"query": "query", "response": "response", "context": "context"}},
        "tool_selection": {"grader": ToolSelectionGrader(model=model), "mapper": {"query": "query", "tool_definitions": "tool_definitions", "tool_calls": "tool_calls"}},
    }
    # 3๏ธโƒฃ Set up aggregator for overall score
    aggregator = WeightedSumAggregator(name="overall_score", weights={"relevance": 0.3, "hallucination": 0.4, "tool_selection": 0.3})
    # 4๏ธโƒฃ Run evaluation
    results = await GradingRunner(grader_configs=grader_configs, aggregators=[aggregator], max_concurrency=5).arun(dataset)
    # 5๏ธโƒฃ Generate evaluation report
    overall_stats = DistributionAnalyzer().analyze(dataset, results["overall_score"])
    print(f"{'Overall Score':<20} | {overall_stats.mean:>15.2f}")

if __name__ == "__main__":
    asyncio.run(main())

Build Custom Graders for Your Scenario

Zero-shot Rubric Generation

Generate a custom grader from task description without labeled data: ๐Ÿ‘‰ Zero-shot Rubrics Generation Guide

When to use: Quick prototyping when you have no labeled data but can clearly describe your task.

import asyncio
from openjudge.generator.simple_rubric import SimpleRubricsGenerator, SimpleRubricsGeneratorConfig
from openjudge.models import OpenAIChatModel

async def main():
    # 1๏ธโƒฃ Configure generator
    config = SimpleRubricsGeneratorConfig(
        grader_name="customer_service_grader",
        model=OpenAIChatModel(model="qwen3-max"),
        task_description="E-commerce AI customer service primarily handles order inquiry tasks (such as logistics status and ETA) while focusing on managing customer emotions.",
        min_score=1,
        max_score=3,
    )
    # 2๏ธโƒฃ Generate grader
    generator = SimpleRubricsGenerator(config)
    grader = await generator.generate(dataset=[], sample_queries=[])
    # 3๏ธโƒฃ View generated rubrics
    print("Generated Rubrics:", grader.kwargs.get("rubrics"))
    # 4๏ธโƒฃ Use the grader
    result = await grader.aevaluate(
        query="My order is delayed, what should I do?",
        response="I understand your concern. Let me check your order status..."
    )
    print(f"\nScore: {result.score}/3\nReason: {result.reason}")

if __name__ == "__main__":
    asyncio.run(main())

Data-driven Rubric Generation

Learn evaluation criteria from labeled examples: ๐Ÿ‘‰ Data-driven Rubrics Generation Guide

When to use: You have labeled data and need high-accuracy graders for production use, especially when evaluation criteria are implicit.

import asyncio
from openjudge.generator.iterative_rubric.generator import IterativeRubricsGenerator, IterativePointwiseRubricsGeneratorConfig
from openjudge.models import OpenAIChatModel
from openjudge.models.schema.prompt_template import LanguageEnum

# Prepare labeled dataset (simplified example, recommend 10+ samples in practice)
labeled_dataset = [
    {"query": "My order hasn't arrived after 10 days, I want to complain!", "response": "I sincerely apologize for the delay. I completely understand your frustration! Your order was delayed due to weather conditions, but it has now resumed shipping and is expected to arrive tomorrow. I've marked it for priority delivery.", "label_score": 5},
    {"query": "Where is my package? I need it urgently!", "response": "I understand your urgency! Your package is currently out for delivery and is expected to arrive before 2 PM today. The delivery driver's contact number is 138xxxx.", "label_score": 5},
    {"query": "Why hasn't my order arrived yet? I've been waiting for days!", "response": "Your order is expected to arrive the day after tomorrow.", "label_score": 2},
    {"query": "The logistics hasn't updated in 3 days, is it lost?", "response": "Hello, your package is not lost. It's still in transit, please wait patiently.", "label_score": 3},
    # ... more labeled examples
]

async def main():
    # 1๏ธโƒฃ Configure generator
    config = IterativePointwiseRubricsGeneratorConfig(
        grader_name="customer_service_grader_v2", model=OpenAIChatModel(model="qwen3-max"),
        min_score=1, max_score=5,
        enable_categorization=True, categories_number=5,  # Enable categorization, Aggregate into 5 themes
    )
    # 2๏ธโƒฃ Generate grader from labeled data
    generator = IterativeRubricsGenerator(config)
    grader = await generator.generate(labeled_dataset)
    # 3๏ธโƒฃ View learned rubrics
    print("\nLearned Rubrics from Labeled Data:\n",grader.kwargs.get("rubrics", "No rubrics generated"))
    # 4๏ธโƒฃ Evaluate new samples
    test_cases = [
        {"query": "My order hasn't moved in 5 days, can you check? I'm a bit worried", "response": "I understand your concern! Let me check immediately: Your package is currently at XX distribution center. Due to recent high order volume, there's a slight delay, but it's expected to arrive the day after tomorrow. I'll proactively contact you if there are any issues."},
        {"query": "Why is this delivery so slow? I'm waiting to use it!", "response": "Checking, please wait."},
    ]
    print("\n" + "=" * 70, "\nEvaluation Results:\n", "=" * 70)
    for i, case in enumerate(test_cases):
        result = await grader.aevaluate(query=case["query"], response=case["response"])
        print(f"\n[Test {i+1}]\n  Query: {case['query']}\n  Response: {case['response']}\n  Score: {result.score}/5\n  Reason: {result.reason[:200]}...")

if __name__ == "__main__":
    asyncio.run(main())

๐Ÿ”— Integrations

Seamlessly connect OpenJudge with mainstream observability and training platforms:

Category Platform Status Documentation
Observability LangSmith โœ… Available ๐Ÿ‘‰ LangSmith Integration Guide
Langfuse โœ… Available ๐Ÿ‘‰ Langfuse Integration Guide
Other frameworks ๐Ÿ”ต Planned โ€”
Training verl โœ… Available ๐Ÿ‘‰ VERL Integration Guide
Trinity-RFT ๐Ÿ”ต Planned โ€”

๐Ÿ’ฌ Have a framework you'd like us to prioritize? Open an Issue!


๐Ÿค Contributing

We love your input! We want to make contributing to OpenJudge as easy and transparent as possible.

๐ŸŽจ Adding New Graders โ€” Have domain-specific evaluation logic? Share it with the community! ๐Ÿ› Reporting Bugs โ€” Found a glitch? Help us fix it by opening an issue ๐Ÿ“ Improving Docs โ€” Clearer explanations or better examples are always welcome ๐Ÿ’ก Proposing Features โ€” Have ideas for new integrations? Let's discuss!

๐Ÿ“– See full Contributing Guidelines for coding standards and PR process.


๐Ÿ’ฌ Community

Join our DingTalk group to connect with the community:

DingTalk QR Code

Migration Guide (v0.1.x โ†’ v0.2.0)

OpenJudge was previously distributed as the legacy package rm-gallery (v0.1.x). Starting from v0.2.0, it is published as py-openjudge and the Python import namespace is openjudge.

OpenJudge v0.2.0 is NOT backward compatible with v0.1.x. If you are currently using v0.1.x, choose one of the following paths:

  • Stay on v0.1.x (legacy): keep using the old package
pip install rm-gallery

We preserved the source code of v0.1.7 (the latest v0.1.x release) in the v0.1.7-legacy branch.

If you run into migration issues, please open an issue with your minimal repro and current version.


๐Ÿ“„ Citation

If you use OpenJudge in your research, please cite:

@software{
  title  = {OpenJudge: A Unified Framework for Holistic Evaluation and Quality Rewards},
  author = {The OpenJudge Team},
  url    = {https://github.com/agentscope-ai/OpenJudge},
  month  = {07},
  year   = {2025}
}

Made with โค๏ธ by the OpenJudge Team

๐ŸŒ Website ยท ๐Ÿš€ Try Online ยท โญ Star Us ยท ๐Ÿ› Report Bug ยท ๐Ÿ’ก Request Feature

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

py_openjudge-0.2.4.tar.gz (874.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

py_openjudge-0.2.4-py3-none-any.whl (1.3 MB view details)

Uploaded Python 3

File details

Details for the file py_openjudge-0.2.4.tar.gz.

File metadata

  • Download URL: py_openjudge-0.2.4.tar.gz
  • Upload date:
  • Size: 874.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.9

File hashes

Hashes for py_openjudge-0.2.4.tar.gz
Algorithm Hash digest
SHA256 63a419fbfda55f15b634e717aef4a879808f26e9e2ceb818dd7487fc54c804cc
MD5 e80cb959de1662a4236e2327be5d7438
BLAKE2b-256 79a1ddef1bbd502158971174ab3388977c318c9d434728b30d19aede66ad02be

See more details on using hashes here.

File details

Details for the file py_openjudge-0.2.4-py3-none-any.whl.

File metadata

File hashes

Hashes for py_openjudge-0.2.4-py3-none-any.whl
Algorithm Hash digest
SHA256 1c4e1489c743ba6cdfeb9d6b68c2e3f82b914e26eafb7337699d32474621269e
MD5 bdc9e5399c1f9467159b868a548507bd
BLAKE2b-256 40133d47c0ce773950b6d0c9c1b94ba47742987bafa0427c06cc8ac063145222

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page