Skip to main content

A Python package for aligning LLM judges to human preferences

Project description

JudgeSync 🧑‍⚖️

JudgeSync is a lightweight Python package for calibrating LLM judges to align with human evaluations. It helps minimize bias and improve reliability in LLM-as-a-judge workflows by comparing different judge configurations and finding the best alignment with human scores.

Why JudgeSync?

LLM judges are powerful but exhibit biases:

  • Verbosity bias: Favoring longer responses even with errors
  • Position bias: Preferring the first option presented
  • Self-bias: Favoring outputs from similar models
  • Leniency/Strictness: Inconsistent scoring patterns

JudgeSync helps you find the optimal judge configuration (prompt, model, temperature) that best aligns with human judgments.

Installation

pip install judgesync

Quick Start

from judgesync import AlignmentTracker, ScoreRange

# Load your evaluation data with human scores
tracker = AlignmentTracker(score_range=ScoreRange.FIVE_POINT)
tracker.load_human_scores_from_csv("evaluation_data.csv")

# Ensure Azure OpenAI credentials are configured (see section below)
prompt_comparison = tracker.create_comparison()

prompt_comparison.add_judge(
    name="strict",
    system_prompt="You are a strict evaluator. Only give high scores to exceptional responses.",
)

prompt_comparison.add_judge(
    name="balanced",
    system_prompt="You are a balanced evaluator. Consider both strengths and weaknesses fairly.",
)

prompt_comparison.add_judge(
    name="detailed_rubric",
    system_prompt="""Rate responses on a 1-5 scale:\n5: Comprehensive, accurate, well-structured\n4: Good accuracy, minor gaps\n3: Adequate, addresses main points\n2: Partially correct, significant gaps\n1: Incorrect or irrelevant""",
)

# Run comparison and find the best judge
results = prompt_comparison.run_comparison(tracker.data_loader.items, use_async=True)
print(results)

# Visualize results (requires matplotlib: pip install matplotlib)
prompt_comparison.plot_comparison(results, save_path="judge_comparison.png")

Key Metrics

Cohen's Kappa (κ)

Measures agreement between human and LLM judge, accounting for chance agreement:

  • κ > 0.7: Production-ready alignment
  • κ = 0.4-0.7: Good alignment, may need fine-tuning
  • κ < 0.4: Poor alignment, needs improvement

Agreement Rate

Percentage of exact score matches between human and judge.

Features

🎯 Score Alignment

  • Support for multiple scoring scales (binary, 5-point, 10-point, percentage)
  • Automatic score range validation
  • Statistical metrics (Cohen's Kappa, correlation, agreement rate)

🔬 Judge Comparison

  • Test multiple prompts/models simultaneously
  • Async batch processing for efficiency
  • Identify optimal judge configuration

📊 Visualization

  • Performance comparison charts
  • Score distribution analysis
  • Disagreement identification

🎛️ Model Configuration

  • Compare different models (GPT-4, GPT-3.5, etc.)
  • Test temperature settings
  • Experiment with custom prompts

Advanced Usage

Compare Different Models

configs = [
    JudgeConfig(
        name="gpt-4-cold",
        system_prompt="Rate the response quality.",
        deployment_name="gpt-4",
        temperature=0.0,
    ),
    JudgeConfig(
        name="gpt-4-warm",
        system_prompt="Rate the response quality.",
        deployment_name="gpt-4",
        temperature=0.7,
    ),
]
comparison = JudgeComparison(configs, items)
results = comparison.run_comparison()

Analyze Disagreements

# Find items where judges disagree significantly
disagreements = comparison.get_disagreement_items(results, threshold=1.0)
print(f"Found {len(disagreements)} items with high disagreement")

Custom Azure OpenAI Configuration

(Set these before using create_comparison().)

# Option 1: Environment variables (.env file)
AZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com/
AZURE_OPENAI_API_KEY=your-api-key
AZURE_OPENAI_DEPLOYMENT=gpt-4


# Option 2: Direct configuration
judge = Judge(
    system_prompt="Your prompt here",
    azure_endpoint="https://your-resource.openai.azure.com/",
    api_key="your-api-key",
    deployment_name="gpt-4"
)

CSV Format

Your evaluation data should have these columns:

  • question: The input/prompt
  • response: The response to evaluate
  • human_score: The human-assigned score
question,response,human_score
What is the capital of France?,"Paris is the capital of France.",5
Explain photosynthesis.,"Plants make food from sunlight.",3

Visualization Examples

JudgeSync generates comprehensive comparison charts showing:

  • Cohen's Kappa scores by judge
  • Agreement rates
  • Score distributions
  • Correlation analysis

Judge Comparison Example

Best Practices

  1. Start with diverse test data: Include responses across all score ranges
  2. Test multiple prompts: Even small wording changes can impact alignment
  3. Consider temperature: Lower temperatures (0.0-0.3) often provide more consistent scoring
  4. Validate on held-out data: Ensure your calibrated judge generalizes well
  5. Monitor for biases: Check if judges favor certain response styles

Requirements

  • Python 3.9+
  • Azure OpenAI API access
  • pandas
  • numpy
  • scikit-learn
  • python-dotenv

Contributing

See CONTRIBUTING.md for setup instructions, coding standards, and our preferred workflow.

License

MIT License - See LICENSE file for details.

Citation

If you use JudgeSync in your research, please cite:

@software{judgesync2025,
  title = {JudgeSync: Calibrating LLM Judges with Human Feedback},
  author = {Asher, James},
  year = {2025},
  url = {https://github.com/jasher4994/judgesync}
}

Acknowledgments

Inspired by research on LLM judge calibration and the challenges observed in production LLM evaluation systems.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

judgesync-0.1.0.tar.gz (25.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

judgesync-0.1.0-py3-none-any.whl (19.9 kB view details)

Uploaded Python 3

File details

Details for the file judgesync-0.1.0.tar.gz.

File metadata

  • Download URL: judgesync-0.1.0.tar.gz
  • Upload date:
  • Size: 25.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for judgesync-0.1.0.tar.gz
Algorithm Hash digest
SHA256 0a1df6227186370650ac67b7e7e96d52ddf45b9bc9f39000ab849aa8b4a007cc
MD5 d75535f0df583ef48394821d91c078f9
BLAKE2b-256 e24cb0b95f9d0edf3d8d7d19b2c4b6f0551955b25452fbbceba52bac6859bbea

See more details on using hashes here.

File details

Details for the file judgesync-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: judgesync-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 19.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for judgesync-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5481c50c66bc7ab68d198ec7bd542284a13da8b98402abed599d982f690bcf39
MD5 06bf4016825fea7f6c93b51419f53ced
BLAKE2b-256 8033c4ed2ce993fcd71248ecf9561abb29490000be6790fc1ff8fb05092bfc27

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page