A Python package for aligning LLM judges to human preferences
Project description
JudgeSync 🧑⚖️
JudgeSync is a lightweight Python package for calibrating LLM judges to align with human evaluations. It helps minimize bias and improve reliability in LLM-as-a-judge workflows by comparing different judge configurations and finding the best alignment with human scores.
Why JudgeSync?
LLM judges are powerful but exhibit biases:
- Verbosity bias: Favoring longer responses even with errors
- Position bias: Preferring the first option presented
- Self-bias: Favoring outputs from similar models
- Leniency/Strictness: Inconsistent scoring patterns
JudgeSync helps you find the optimal judge configuration (prompt, model, temperature) that best aligns with human judgments.
Installation
pip install judgesync
Quick Start
from judgesync import AlignmentTracker, ScoreRange
# Load your evaluation data with human scores
tracker = AlignmentTracker(score_range=ScoreRange.FIVE_POINT)
tracker.load_human_scores_from_csv("evaluation_data.csv")
# Ensure Azure OpenAI credentials are configured (see section below)
prompt_comparison = tracker.create_comparison()
prompt_comparison.add_judge(
name="strict",
system_prompt="You are a strict evaluator. Only give high scores to exceptional responses.",
)
prompt_comparison.add_judge(
name="balanced",
system_prompt="You are a balanced evaluator. Consider both strengths and weaknesses fairly.",
)
prompt_comparison.add_judge(
name="detailed_rubric",
system_prompt="""Rate responses on a 1-5 scale:\n5: Comprehensive, accurate, well-structured\n4: Good accuracy, minor gaps\n3: Adequate, addresses main points\n2: Partially correct, significant gaps\n1: Incorrect or irrelevant""",
)
# Run comparison and find the best judge
results = prompt_comparison.run_comparison(tracker.data_loader.items, use_async=True)
print(results)
# Visualize results (requires matplotlib: pip install matplotlib)
prompt_comparison.plot_comparison(results, save_path="judge_comparison.png")
Key Metrics
Cohen's Kappa (κ)
Measures agreement between human and LLM judge, accounting for chance agreement:
- κ > 0.7: Production-ready alignment
- κ = 0.4-0.7: Good alignment, may need fine-tuning
- κ < 0.4: Poor alignment, needs improvement
Agreement Rate
Percentage of exact score matches between human and judge.
Features
🎯 Score Alignment
- Support for multiple scoring scales (binary, 5-point, 10-point, percentage)
- Automatic score range validation
- Statistical metrics (Cohen's Kappa, correlation, agreement rate)
🔬 Judge Comparison
- Test multiple prompts/models simultaneously
- Async batch processing for efficiency
- Identify optimal judge configuration
📊 Visualization
- Performance comparison charts
- Score distribution analysis
- Disagreement identification
🎛️ Model Configuration
- Compare different models (GPT-4, GPT-3.5, etc.)
- Test temperature settings
- Experiment with custom prompts
Advanced Usage
Compare Different Models
configs = [
JudgeConfig(
name="gpt-4-cold",
system_prompt="Rate the response quality.",
deployment_name="gpt-4",
temperature=0.0,
),
JudgeConfig(
name="gpt-4-warm",
system_prompt="Rate the response quality.",
deployment_name="gpt-4",
temperature=0.7,
),
]
comparison = JudgeComparison(configs, items)
results = comparison.run_comparison()
Analyze Disagreements
# Find items where judges disagree significantly
disagreements = comparison.get_disagreement_items(results, threshold=1.0)
print(f"Found {len(disagreements)} items with high disagreement")
Custom Azure OpenAI Configuration
(Set these before using create_comparison().)
# Option 1: Environment variables (.env file)
AZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com/
AZURE_OPENAI_API_KEY=your-api-key
AZURE_OPENAI_DEPLOYMENT=gpt-4
# Option 2: Direct configuration
judge = Judge(
system_prompt="Your prompt here",
azure_endpoint="https://your-resource.openai.azure.com/",
api_key="your-api-key",
deployment_name="gpt-4"
)
CSV Format
Your evaluation data should have these columns:
question: The input/promptresponse: The response to evaluatehuman_score: The human-assigned score
question,response,human_score
What is the capital of France?,"Paris is the capital of France.",5
Explain photosynthesis.,"Plants make food from sunlight.",3
Visualization Examples
JudgeSync generates comprehensive comparison charts showing:
- Cohen's Kappa scores by judge
- Agreement rates
- Score distributions
- Correlation analysis
Best Practices
- Start with diverse test data: Include responses across all score ranges
- Test multiple prompts: Even small wording changes can impact alignment
- Consider temperature: Lower temperatures (0.0-0.3) often provide more consistent scoring
- Validate on held-out data: Ensure your calibrated judge generalizes well
- Monitor for biases: Check if judges favor certain response styles
Requirements
- Python 3.9+
- Azure OpenAI API access
- pandas
- numpy
- scikit-learn
- python-dotenv
Contributing
See CONTRIBUTING.md for setup instructions, coding standards, and our preferred workflow.
License
MIT License - See LICENSE file for details.
Citation
If you use JudgeSync in your research, please cite:
@software{judgesync2025,
title = {JudgeSync: Calibrating LLM Judges with Human Feedback},
author = {Asher, James},
year = {2025},
url = {https://github.com/jasher4994/judgesync}
}
Acknowledgments
Inspired by research on LLM judge calibration and the challenges observed in production LLM evaluation systems.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file judgesync-0.1.0.tar.gz.
File metadata
- Download URL: judgesync-0.1.0.tar.gz
- Upload date:
- Size: 25.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0a1df6227186370650ac67b7e7e96d52ddf45b9bc9f39000ab849aa8b4a007cc
|
|
| MD5 |
d75535f0df583ef48394821d91c078f9
|
|
| BLAKE2b-256 |
e24cb0b95f9d0edf3d8d7d19b2c4b6f0551955b25452fbbceba52bac6859bbea
|
File details
Details for the file judgesync-0.1.0-py3-none-any.whl.
File metadata
- Download URL: judgesync-0.1.0-py3-none-any.whl
- Upload date:
- Size: 19.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5481c50c66bc7ab68d198ec7bd542284a13da8b98402abed599d982f690bcf39
|
|
| MD5 |
06bf4016825fea7f6c93b51419f53ced
|
|
| BLAKE2b-256 |
8033c4ed2ce993fcd71248ecf9561abb29490000be6790fc1ff8fb05092bfc27
|