SLMJury: Can Small Language Models Judge as Well as Large Language Models?
Project description
Can Small Language Models Judge as Well as Large Language Models?
๐งโโ๏ธ 16 SLM Judges โข ๐ 10 Datasets โข ๐ณ๏ธ 3 Advanced Strategies โข ๐ญ 6 Persona Prompts
๐ก What is SLMJury?
SLMJury is a comprehensive framework that investigates whether Small Language Models (0.6Bโ14B parameters) can serve as reliable judges across both closed-ended (accuracy-based) and open-ended (correlation-based) evaluation paradigms. The project explores six evaluation modes: individual judging, persona-based evaluation, majority-vote ensembles, multi-agent debate, human agreement scoring (SummEval), and LLM agreement scoring (MT-Bench).
๐ Key Highlights
๐ง Individual Judging
|
๐ณ๏ธ Majority Voting
|
๐ค Multi-Agent Debate
|
โก Installation
๐ง From Source (Recommended)
git clone https://github.com/anishh15/SLMJury.git
cd SLMJury
pip install -e .
๐ Quick Start
๐ป CLI Scripts
# Step 1: Run student model inference
python scripts/run_student.py --model qwen2.5-32b --datasets gsm8k math
# Step 2: Run judge evaluations
python scripts/run_judge.py --judge qwen3-4b --max-tokens 10 8192
# Step 3: Evaluate all judgements and generate summaries
python scripts/run_evaluation.py
๐ Python API
from slmjury.core.solver import StudentSolver
from slmjury.core.judge import JudgeModel
from slmjury.core.evaluator import JudgeEvaluator
# Step 1: Solve problems with a student model
solver = StudentSolver("qwen2.5-32b")
results = solver.solve_batch(problems, "gsm8k")
solver.save_results(results, "gsm8k")
solver.cleanup()
# Step 2: Judge the solutions
judge = JudgeModel("qwen3-4b")
judgements = judge.evaluate_batch(results, max_tokens=10)
judge.save_results(judgements, "qwen2.5-32b", "gsm8k", 10)
judge.cleanup()
# Step 3: Evaluate judge accuracy
evaluator = JudgeEvaluator("qwen3-4b", "qwen2.5-32b", "gsm8k", 10, judgements)
summary = evaluator.evaluate()
๐งฉ Advanced: Multi-Agent Strategies
# Majority voting ensemble
from slmjury.strategies.ensemble import run_majority_voting
run_majority_voting(
judge_keys=["qwen3-4b", "phi4mi-3.8b", "qwen2.5-3b"],
student_results=results,
max_tokens=10,
)
# Multi-agent debate (3 judges, RCR prompting)
from slmjury.strategies.debate import run_debate
run_debate(
combo_models=["qwen3-4b", "phi4mi-3.8b", "qwen2.5-3b"],
combo_temps=[0, 0, 0],
student_results=results,
dataset_name="gsm8k",
)
# Persona effects (6 system prompts ร all judges)
from slmjury.strategies.persona import run_persona_evaluation
run_persona_evaluation("qwen3-4b", results, max_tokens=10)
๐ฌ Open-Ended Scoring (SummEval / MT-Bench)
# Score SummEval with a single judge
python scripts/run_scoring_judge.py \
--judge qwen3-4b --dataset summeval
# Score MT-Bench with a single judge
python scripts/run_scoring_judge.py \
--judge qwen3-4b --dataset mtbench \
--oracle-scores results/mtbench_oracle/
from slmjury.core.scoring_judge import ScoringJudge
judge = ScoringJudge("qwen3-4b", output_dir="results/scoring")
# Score SummEval (4-dimension scoring)
summeval_data = load_dataset("summeval")
results = judge.score_summeval(summeval_data, max_tokens=8192)
judge.save_results(results, "summeval")
judge.cleanup()
๐ค Supported Models
| Family | Models | Parameters | Thinking |
|---|---|---|---|
| Qwen 2.5 | 1.5B, 3B, 7B | 1.5B โ 7B | โ |
| Qwen 3 | 0.6B, 1.7B, 4B, 8B, 14B | 0.6B โ 14B | โ |
| Llama 3.x | 3.2-1B, 3.2-3B, 3.1-8B | 1B โ 8B | โ |
| Phi-4 | 14B, Reasoning, R-Plus, Mini, Mini-Reasoning | 3.8B โ 14B | โ * |
*Phi-4 Reasoning/Plus/Mini-Reasoning always use thinking mode and skip quick verdict (t=10) evaluation.
๐ Datasets
Closed-ended (verdict: Correct/Incorrect):
| Dataset | Type | Domain | Size |
|---|---|---|---|
| GSM8K | Numeric | Math | 1,319 |
| GSM-Plus | Numeric | Math | 10,552 |
| MATH | LaTeX | Math | 5,000 |
| ARC-Easy | Multiple Choice | Science | 2,376 |
| ARC-Challenge | Multiple Choice | Science | 1,172 |
| HellaSwag | Multiple Choice | General | 10,042 |
| WinoGrande | Multiple Choice | General | 1,267 |
| TruthfulQA | Multiple Choice | General | 684 |
Open-ended (scoring: 1โ5):
| Dataset | Type | Turns | Size | Oracle |
|---|---|---|---|---|
| SummEval | Summarization | โ | 1,600 pairs | Human annotations |
| MT-Bench | Multi-turn chat | 2 | 80 questions | GPT-OSS-120B, Qwen3.5-397B (Together API) |
๐๏ธ Project Structure
SLMJury/
โโโ slmjury/ # Python package
โ โโโ configs/ # Centralized YAML model configurations
โ โโโ data/ # Dataset loaders (HuggingFace โ local JSON)
โ โโโ parsers/ # Answer extraction, normalization, verdict/score parsing
โ โโโ core/ # Pipeline: solver โ judge โ evaluator + scoring
โ โโโ strategies/ # Ensemble voting, multi-agent debate, personas
โโโ scripts/ # CLI entry-points (student, judge, oracle, scoring)
โโโ bash/ # Bash wrappers for full experiment runs
โโโ tests/ # Unit & integration tests (pytest)
โโโ website/ # React leaderboard (Vite + Tailwind)
โโโ assets/ # SVG banner and logo
โโโ pyproject.toml # Package config (pip install -e .)
โโโ README.md
๐ Leaderboard
Explore full results on the interactive leaderboard:
๐ Citation
If you use SLMJury in your research, please cite:
@misc{laddha2026slmjury,
title={SLMJury: Can Small Language Models Judge as Well as Large Language Models?},
author={Anish Laddha and Nitesh Pradhan and Gaurav Srivastava},
year={2026},
}
๐ License
Apache License 2.0 โ see LICENSE for details.
Made with โค๏ธ by Anish Laddha, Nitesh Pradhan, and Gaurav Srivastava
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file slmjury-0.1.0.tar.gz.
File metadata
- Download URL: slmjury-0.1.0.tar.gz
- Upload date:
- Size: 51.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
20b9711b73536d9391c08def2c763aa660c973127b62201a397072c47973be82
|
|
| MD5 |
4010977c643cf821f76490d02f7c6b38
|
|
| BLAKE2b-256 |
90ca8d9eeb24d41758a4b3acc8427e47776d4cf3f7ca852b2d679720cfe34095
|
File details
Details for the file slmjury-0.1.0-py3-none-any.whl.
File metadata
- Download URL: slmjury-0.1.0-py3-none-any.whl
- Upload date:
- Size: 64.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b878a180785bbc724bb470edbaeb26732b1046d2338f3924806c865203ee3392
|
|
| MD5 |
9b4e4732e473b93694b895b3d41fe290
|
|
| BLAKE2b-256 |
f9cd7b049b8814dcb29815a388d7bc9e4f0231cb6795d6cb427d4b4f81ca069e
|