Skip to main content

SLMJury: Can Small Language Models Judge as Well as Large Language Models?

Project description

SLMJury Banner

PyPI Paper Python License Stars

Can Small Language Models Judge as Well as Large Language Models?

๐Ÿง‘โ€โš–๏ธ 16 SLM Judges โ€ข ๐Ÿ“Š 10 Datasets โ€ข ๐Ÿ—ณ๏ธ 3 Advanced Strategies โ€ข ๐ŸŽญ 6 Persona Prompts

๐Ÿ† Leaderboard | ๐Ÿš€ Get Started


๐Ÿ’ก What is SLMJury?

SLMJury is a comprehensive framework that investigates whether Small Language Models (0.6Bโ€“14B parameters) can serve as reliable judges across both closed-ended (accuracy-based) and open-ended (correlation-based) evaluation paradigms. The project explores six evaluation modes: individual judging, persona-based evaluation, majority-vote ensembles, multi-agent debate, human agreement scoring (SummEval), and LLM agreement scoring (MT-Bench).

๐ŸŒŸ Key Highlights

๐Ÿง  Individual Judging

  • 16 SLM judges from 4 model families
  • Quick verdict vs. reasoned response
  • Accuracy & Instruction Following Rate

๐Ÿ—ณ๏ธ Majority Voting

  • C(5,3) ensemble combinations
  • Top-5 best individual judges
  • Boosted accuracy via consensus

๐Ÿค Multi-Agent Debate

  • RCR (Reflect-Critique-Refine) prompting
  • Cross-architecture & same-model variants
  • Up to 5 rounds with consensus fallback

โšก Installation

๐Ÿ”ง From Source (Recommended)

git clone https://github.com/anishh15/SLMJury.git
cd SLMJury
pip install -e .

๐Ÿš€ Quick Start

๐Ÿ’ป CLI Scripts

# Step 1: Run student model inference
python scripts/run_student.py --model qwen2.5-32b --datasets gsm8k math

# Step 2: Run judge evaluations
python scripts/run_judge.py --judge qwen3-4b --max-tokens 10 8192

# Step 3: Evaluate all judgements and generate summaries
python scripts/run_evaluation.py

๐Ÿ Python API

from slmjury.core.solver import StudentSolver
from slmjury.core.judge import JudgeModel
from slmjury.core.evaluator import JudgeEvaluator

# Step 1: Solve problems with a student model
solver = StudentSolver("qwen2.5-32b")
results = solver.solve_batch(problems, "gsm8k")
solver.save_results(results, "gsm8k")
solver.cleanup()

# Step 2: Judge the solutions
judge = JudgeModel("qwen3-4b")
judgements = judge.evaluate_batch(results, max_tokens=10)
judge.save_results(judgements, "qwen2.5-32b", "gsm8k", 10)
judge.cleanup()

# Step 3: Evaluate judge accuracy
evaluator = JudgeEvaluator("qwen3-4b", "qwen2.5-32b", "gsm8k", 10, judgements)
summary = evaluator.evaluate()
๐Ÿงฉ Advanced: Multi-Agent Strategies
# Majority voting ensemble
from slmjury.strategies.ensemble import run_majority_voting
run_majority_voting(
    judge_keys=["qwen3-4b", "phi4mi-3.8b", "qwen2.5-3b"],
    student_results=results,
    max_tokens=10,
)

# Multi-agent debate (3 judges, RCR prompting)
from slmjury.strategies.debate import run_debate
run_debate(
    combo_models=["qwen3-4b", "phi4mi-3.8b", "qwen2.5-3b"],
    combo_temps=[0, 0, 0],
    student_results=results,
    dataset_name="gsm8k",
)

# Persona effects (6 system prompts ร— all judges)
from slmjury.strategies.persona import run_persona_evaluation
run_persona_evaluation("qwen3-4b", results, max_tokens=10)
๐Ÿ”ฌ Open-Ended Scoring (SummEval / MT-Bench)
# Score SummEval with a single judge
python scripts/run_scoring_judge.py \
  --judge qwen3-4b --dataset summeval

# Score MT-Bench with a single judge
python scripts/run_scoring_judge.py \
  --judge qwen3-4b --dataset mtbench \
  --oracle-scores results/mtbench_oracle/
from slmjury.core.scoring_judge import ScoringJudge

judge = ScoringJudge("qwen3-4b", output_dir="results/scoring")

# Score SummEval (4-dimension scoring)
summeval_data = load_dataset("summeval")
results = judge.score_summeval(summeval_data, max_tokens=8192)
judge.save_results(results, "summeval")
judge.cleanup()

๐Ÿค– Supported Models

Family Models Parameters Thinking
Qwen 2.51.5B, 3B, 7B1.5B โ€“ 7Bโ€”
Qwen 30.6B, 1.7B, 4B, 8B, 14B0.6B โ€“ 14Bโœ…
Llama 3.x3.2-1B, 3.2-3B, 3.1-8B1B โ€“ 8Bโ€”
Phi-414B, Reasoning, R-Plus, Mini, Mini-Reasoning3.8B โ€“ 14Bโœ…*

*Phi-4 Reasoning/Plus/Mini-Reasoning always use thinking mode and skip quick verdict (t=10) evaluation.

๐Ÿ“Š Datasets

Closed-ended (verdict: Correct/Incorrect):

Dataset Type Domain Size
GSM8K Numeric Math 1,319
GSM-Plus Numeric Math 10,552
MATH LaTeX Math 5,000
ARC-Easy Multiple Choice Science 2,376
ARC-Challenge Multiple Choice Science 1,172
HellaSwag Multiple Choice General 10,042
WinoGrande Multiple Choice General 1,267
TruthfulQA Multiple Choice General 684

Open-ended (scoring: 1โ€“5):

Dataset Type Turns Size Oracle
SummEval Summarization โ€” 1,600 pairs Human annotations
MT-Bench Multi-turn chat 2 80 questions GPT-OSS-120B, Qwen3.5-397B (Together API)

๐Ÿ—๏ธ Project Structure

SLMJury/
โ”œโ”€โ”€ slmjury/                  # Python package
โ”‚   โ”œโ”€โ”€ configs/              # Centralized YAML model configurations
โ”‚   โ”œโ”€โ”€ data/                 # Dataset loaders (HuggingFace โ†’ local JSON)
โ”‚   โ”œโ”€โ”€ parsers/              # Answer extraction, normalization, verdict/score parsing
โ”‚   โ”œโ”€โ”€ core/                 # Pipeline: solver โ†’ judge โ†’ evaluator + scoring
โ”‚   โ””โ”€โ”€ strategies/           # Ensemble voting, multi-agent debate, personas
โ”œโ”€โ”€ scripts/                  # CLI entry-points (student, judge, oracle, scoring)
โ”œโ”€โ”€ bash/                     # Bash wrappers for full experiment runs
โ”œโ”€โ”€ tests/                    # Unit & integration tests (pytest)
โ”œโ”€โ”€ website/                  # React leaderboard (Vite + Tailwind)
โ”œโ”€โ”€ assets/                   # SVG banner and logo
โ”œโ”€โ”€ pyproject.toml            # Package config (pip install -e .)
โ””โ”€โ”€ README.md

๐Ÿ† Leaderboard

Explore full results on the interactive leaderboard:


๐Ÿ“– Citation

If you use SLMJury in your research, please cite:

@misc{laddha2026slmjury,
      title={SLMJury: Can Small Language Models Judge as Well as Large Language Models?},
      author={Anish Laddha and Nitesh Pradhan and Gaurav Srivastava},
      year={2026},
}

๐Ÿ“„ License

Apache License 2.0 โ€” see LICENSE for details.


Get Started Leaderboard GitHub

Made with โค๏ธ by Anish Laddha, Nitesh Pradhan, and Gaurav Srivastava

SLMJury Logo

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

slmjury-0.1.0.tar.gz (51.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

slmjury-0.1.0-py3-none-any.whl (64.9 kB view details)

Uploaded Python 3

File details

Details for the file slmjury-0.1.0.tar.gz.

File metadata

  • Download URL: slmjury-0.1.0.tar.gz
  • Upload date:
  • Size: 51.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.8

File hashes

Hashes for slmjury-0.1.0.tar.gz
Algorithm Hash digest
SHA256 20b9711b73536d9391c08def2c763aa660c973127b62201a397072c47973be82
MD5 4010977c643cf821f76490d02f7c6b38
BLAKE2b-256 90ca8d9eeb24d41758a4b3acc8427e47776d4cf3f7ca852b2d679720cfe34095

See more details on using hashes here.

File details

Details for the file slmjury-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: slmjury-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 64.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.8

File hashes

Hashes for slmjury-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b878a180785bbc724bb470edbaeb26732b1046d2338f3924806c865203ee3392
MD5 9b4e4732e473b93694b895b3d41fe290
BLAKE2b-256 f9cd7b049b8814dcb29815a388d7bc9e4f0231cb6795d6cb427d4b4f81ca069e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page