A Python package for aligning LLM judges to human preferences

These details have not been verified by PyPI

Project links

Project description

JudgeSync 🧑‍⚖️

JudgeSync is a lightweight Python package for calibrating LLM judges to align with human evaluations. It helps minimize bias and improve reliability in LLM-as-a-judge workflows by comparing different judge configurations and finding the best alignment with human scores.

Why JudgeSync?

LLM judges are powerful but exhibit biases:

Verbosity bias: Favoring longer responses even with errors
Position bias: Preferring the first option presented
Self-bias: Favoring outputs from similar models
Leniency/Strictness: Inconsistent scoring patterns

JudgeSync helps you find the optimal judge configuration (prompt, model, temperature) that best aligns with human judgments.

Installation

pip install judgesync

Quick Start

from judgesync import AlignmentTracker, ScoreRange

# Load your evaluation data with human scores
tracker = AlignmentTracker(score_range=ScoreRange.FIVE_POINT)
tracker.load_human_scores_from_csv("evaluation_data.csv")

# Ensure Azure OpenAI credentials are configured (see section below)
prompt_comparison = tracker.create_comparison()

prompt_comparison.add_judge(
    name="strict",
    system_prompt="You are a strict evaluator. Only give high scores to exceptional responses.",
)

prompt_comparison.add_judge(
    name="balanced",
    system_prompt="You are a balanced evaluator. Consider both strengths and weaknesses fairly.",
)

prompt_comparison.add_judge(
    name="detailed_rubric",
    system_prompt="""Rate responses on a 1-5 scale:\n5: Comprehensive, accurate, well-structured\n4: Good accuracy, minor gaps\n3: Adequate, addresses main points\n2: Partially correct, significant gaps\n1: Incorrect or irrelevant""",
)

# Run comparison and find the best judge
results = prompt_comparison.run_comparison(tracker.data_loader.items, use_async=True)
print(results)

# Visualize results (requires matplotlib: pip install matplotlib)
prompt_comparison.plot_comparison(results, save_path="judge_comparison.png")

Key Metrics

Cohen's Kappa (κ)

Measures agreement between human and LLM judge, accounting for chance agreement:

κ > 0.7: Production-ready alignment
κ = 0.4-0.7: Good alignment, may need fine-tuning
κ < 0.4: Poor alignment, needs improvement

Agreement Rate

Percentage of exact score matches between human and judge.

Features

🎯 Score Alignment

Support for multiple scoring scales (binary, 5-point, 10-point, percentage)
Automatic score range validation
Statistical metrics (Cohen's Kappa, correlation, agreement rate)

🔬 Judge Comparison

Test multiple prompts/models simultaneously
Async batch processing for efficiency
Identify optimal judge configuration

📊 Visualization

Performance comparison charts
Score distribution analysis
Disagreement identification

🎛️ Model Configuration

Compare different models (GPT-4, GPT-3.5, etc.)
Test temperature settings
Experiment with custom prompts

Advanced Usage

Compare Different Models

configs = [
    JudgeConfig(
        name="gpt-4-cold",
        system_prompt="Rate the response quality.",
        deployment_name="gpt-4",
        temperature=0.0,
    ),
    JudgeConfig(
        name="gpt-4-warm",
        system_prompt="Rate the response quality.",
        deployment_name="gpt-4",
        temperature=0.7,
    ),
]
comparison = JudgeComparison(configs, items)
results = comparison.run_comparison()

Analyze Disagreements

# Find items where judges disagree significantly
disagreements = comparison.get_disagreement_items(results, threshold=1.0)
print(f"Found {len(disagreements)} items with high disagreement")

Custom Azure OpenAI Configuration

(Set these before using create_comparison().)

# Option 1: Environment variables (.env file)
AZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com/
AZURE_OPENAI_API_KEY=your-api-key
AZURE_OPENAI_DEPLOYMENT=gpt-4


# Option 2: Direct configuration
judge = Judge(
    system_prompt="Your prompt here",
    azure_endpoint="https://your-resource.openai.azure.com/",
    api_key="your-api-key",
    deployment_name="gpt-4"
)

CSV Format

Your evaluation data should have these columns:

question: The input/prompt
response: The response to evaluate
human_score: The human-assigned score

question,response,human_score
What is the capital of France?,"Paris is the capital of France.",5
Explain photosynthesis.,"Plants make food from sunlight.",3

Visualization Examples

JudgeSync generates comprehensive comparison charts showing:

Cohen's Kappa scores by judge
Agreement rates
Score distributions
Correlation analysis

Judge Comparison Example

Best Practices

Start with diverse test data: Include responses across all score ranges
Test multiple prompts: Even small wording changes can impact alignment
Consider temperature: Lower temperatures (0.0-0.3) often provide more consistent scoring
Validate on held-out data: Ensure your calibrated judge generalizes well
Monitor for biases: Check if judges favor certain response styles

Requirements

Python 3.9+
Azure OpenAI API access
pandas
numpy
scikit-learn
python-dotenv

Contributing

See CONTRIBUTING.md for setup instructions, coding standards, and our preferred workflow.

License

MIT License - See LICENSE file for details.

Citation

If you use JudgeSync in your research, please cite:

@software{judgesync2025,
  title = {JudgeSync: Calibrating LLM Judges with Human Feedback},
  author = {Asher, James},
  year = {2025},
  url = {https://github.com/jasher4994/judgesync}
}

Acknowledgments

Inspired by research on LLM judge calibration and the challenges observed in production LLM evaluation systems.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Sep 24, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

judgesync-0.1.0.tar.gz (25.4 kB view details)

Uploaded Sep 24, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

judgesync-0.1.0-py3-none-any.whl (19.9 kB view details)

Uploaded Sep 24, 2025 Python 3

File details

Details for the file judgesync-0.1.0.tar.gz.

File metadata

Download URL: judgesync-0.1.0.tar.gz
Upload date: Sep 24, 2025
Size: 25.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for judgesync-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`0a1df6227186370650ac67b7e7e96d52ddf45b9bc9f39000ab849aa8b4a007cc`
MD5	`d75535f0df583ef48394821d91c078f9`
BLAKE2b-256	`e24cb0b95f9d0edf3d8d7d19b2c4b6f0551955b25452fbbceba52bac6859bbea`

See more details on using hashes here.

File details

Details for the file judgesync-0.1.0-py3-none-any.whl.

File metadata

Download URL: judgesync-0.1.0-py3-none-any.whl
Upload date: Sep 24, 2025
Size: 19.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for judgesync-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5481c50c66bc7ab68d198ec7bd542284a13da8b98402abed599d982f690bcf39`
MD5	`06bf4016825fea7f6c93b51419f53ced`
BLAKE2b-256	`8033c4ed2ce993fcd71248ecf9561abb29490000be6790fc1ff8fb05092bfc27`

See more details on using hashes here.

judgesync 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

JudgeSync 🧑‍⚖️

Why JudgeSync?

Installation

Quick Start

Key Metrics

Cohen's Kappa (κ)

Agreement Rate

Features

🎯 Score Alignment

🔬 Judge Comparison

📊 Visualization

🎛️ Model Configuration

Advanced Usage

Compare Different Models

Analyze Disagreements

Custom Azure OpenAI Configuration

CSV Format

Visualization Examples

Best Practices

Requirements

Contributing

License

Citation

Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes