slmjury

SLMJury: Can Small Language Models Judge as Well as Large Language Models?

These details have not been verified by PyPI

Project links

Project description

SLMJury Banner

Can Small Language Models Judge as Well as Large Language Models?

🧑‍⚖️ 16 SLM Judges • 📊 10 Datasets • 🗳️ 3 Advanced Strategies • 🎭 6 Persona Prompts

🏆 Leaderboard | 🚀 Get Started

💡 What is SLMJury?

SLMJury is a comprehensive framework that investigates whether Small Language Models (0.6B–14B parameters) can serve as reliable judges across both closed-ended (accuracy-based) and open-ended (correlation-based) evaluation paradigms. The project explores six evaluation modes: individual judging, persona-based evaluation, majority-vote ensembles, multi-agent debate, human agreement scoring (SummEval), and LLM agreement scoring (MT-Bench).

🌟 Key Highlights

🧠 Individual Judging

16 SLM judges from 4 model families
Quick verdict vs. reasoned response
Accuracy & Instruction Following Rate

🗳️ Majority Voting

C(5,3) ensemble combinations
Top-5 best individual judges
Boosted accuracy via consensus

🤝 Multi-Agent Debate

RCR (Reflect-Critique-Refine) prompting
Cross-architecture & same-model variants
Up to 5 rounds with consensus fallback

⚡ Installation

🔧 From Source (Recommended)

git clone https://github.com/anishh15/SLMJury.git
cd SLMJury
pip install -e .

🚀 Quick Start

💻 CLI Scripts

# Step 1: Run student model inference
python scripts/run_student.py --model qwen2.5-32b --datasets gsm8k math

# Step 2: Run judge evaluations
python scripts/run_judge.py --judge qwen3-4b --max-tokens 10 8192

# Step 3: Evaluate all judgements and generate summaries
python scripts/run_evaluation.py

🐍 Python API

from slmjury.core.solver import StudentSolver
from slmjury.core.judge import JudgeModel
from slmjury.core.evaluator import JudgeEvaluator

# Step 1: Solve problems with a student model
solver = StudentSolver("qwen2.5-32b")
results = solver.solve_batch(problems, "gsm8k")
solver.save_results(results, "gsm8k")
solver.cleanup()

# Step 2: Judge the solutions
judge = JudgeModel("qwen3-4b")
judgements = judge.evaluate_batch(results, max_tokens=10)
judge.save_results(judgements, "qwen2.5-32b", "gsm8k", 10)
judge.cleanup()

# Step 3: Evaluate judge accuracy
evaluator = JudgeEvaluator("qwen3-4b", "qwen2.5-32b", "gsm8k", 10, judgements)
summary = evaluator.evaluate()

🧩 Advanced: Multi-Agent Strategies

# Majority voting ensemble
from slmjury.strategies.ensemble import run_majority_voting
run_majority_voting(
    judge_keys=["qwen3-4b", "phi4mi-3.8b", "qwen2.5-3b"],
    student_results=results,
    max_tokens=10,
)

# Multi-agent debate (3 judges, RCR prompting)
from slmjury.strategies.debate import run_debate
run_debate(
    combo_models=["qwen3-4b", "phi4mi-3.8b", "qwen2.5-3b"],
    combo_temps=[0, 0, 0],
    student_results=results,
    dataset_name="gsm8k",
)

# Persona effects (6 system prompts × all judges)
from slmjury.strategies.persona import run_persona_evaluation
run_persona_evaluation("qwen3-4b", results, max_tokens=10)

🔬 Open-Ended Scoring (SummEval / MT-Bench)

# Score SummEval with a single judge
python scripts/run_scoring_judge.py \
  --judge qwen3-4b --dataset summeval

# Score MT-Bench with a single judge
python scripts/run_scoring_judge.py \
  --judge qwen3-4b --dataset mtbench \
  --oracle-scores results/mtbench_oracle/

from slmjury.core.scoring_judge import ScoringJudge

judge = ScoringJudge("qwen3-4b", output_dir="results/scoring")

# Score SummEval (4-dimension scoring)
summeval_data = load_dataset("summeval")
results = judge.score_summeval(summeval_data, max_tokens=8192)
judge.save_results(results, "summeval")
judge.cleanup()

🤖 Supported Models

Family	Models	Parameters	Thinking
Qwen 2.5	1.5B, 3B, 7B	1.5B – 7B	—
Qwen 3	0.6B, 1.7B, 4B, 8B, 14B	0.6B – 14B	✅
Llama 3.x	3.2-1B, 3.2-3B, 3.1-8B	1B – 8B	—
Phi-4	14B, Reasoning, R-Plus, Mini, Mini-Reasoning	3.8B – 14B	✅*

_{*Phi-4 Reasoning/Plus/Mini-Reasoning always use thinking mode and skip quick verdict (t=10) evaluation.}

📊 Datasets

Closed-ended (verdict: Correct/Incorrect):

Dataset	Type	Domain	Size
GSM8K	Numeric	Math	1,319
GSM-Plus	Numeric	Math	10,552
MATH	LaTeX	Math	5,000
ARC-Easy	Multiple Choice	Science	2,376
ARC-Challenge	Multiple Choice	Science	1,172
HellaSwag	Multiple Choice	General	10,042
WinoGrande	Multiple Choice	General	1,267
TruthfulQA	Multiple Choice	General	684

Open-ended (scoring: 1–5):

Dataset	Type	Turns	Size	Oracle
SummEval	Summarization	—	1,600 pairs	Human annotations
MT-Bench	Multi-turn chat	2	80 questions	GPT-OSS-120B, Qwen3.5-397B (Together API)

🏗️ Project Structure

SLMJury/
├── slmjury/                  # Python package
│   ├── configs/              # Centralized YAML model configurations
│   ├── data/                 # Dataset loaders (HuggingFace → local JSON)
│   ├── parsers/              # Answer extraction, normalization, verdict/score parsing
│   ├── core/                 # Pipeline: solver → judge → evaluator + scoring
│   └── strategies/           # Ensemble voting, multi-agent debate, personas
├── scripts/                  # CLI entry-points (student, judge, oracle, scoring)
├── bash/                     # Bash wrappers for full experiment runs
├── tests/                    # Unit & integration tests (pytest)
├── website/                  # React leaderboard (Vite + Tailwind)
├── assets/                   # SVG banner and logo
├── pyproject.toml            # Package config (pip install -e .)
└── README.md

🏆 Leaderboard

Explore full results on the interactive leaderboard:

📖 Citation

If you use SLMJury in your research, please cite:

@misc{laddha2026slmjury,
      title={SLMJury: Can Small Language Models Judge as Well as Large Language Models?},
      author={Anish Laddha and Nitesh Pradhan and Gaurav Srivastava},
      year={2026},
}

📄 License

Apache License 2.0 — see LICENSE for details.

Made with ❤️ by Anish Laddha, Nitesh Pradhan, and Gaurav Srivastava

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

May 26, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

slmjury-0.1.0.tar.gz (51.9 kB view details)

Uploaded May 26, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

slmjury-0.1.0-py3-none-any.whl (64.9 kB view details)

Uploaded May 26, 2026 Python 3

File details

Details for the file slmjury-0.1.0.tar.gz.

File metadata

Download URL: slmjury-0.1.0.tar.gz
Upload date: May 26, 2026
Size: 51.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.8

File hashes

Hashes for slmjury-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`20b9711b73536d9391c08def2c763aa660c973127b62201a397072c47973be82`
MD5	`4010977c643cf821f76490d02f7c6b38`
BLAKE2b-256	`90ca8d9eeb24d41758a4b3acc8427e47776d4cf3f7ca852b2d679720cfe34095`

See more details on using hashes here.

File details

Details for the file slmjury-0.1.0-py3-none-any.whl.

File metadata

Download URL: slmjury-0.1.0-py3-none-any.whl
Upload date: May 26, 2026
Size: 64.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.8

File hashes

Hashes for slmjury-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b878a180785bbc724bb470edbaeb26732b1046d2338f3924806c865203ee3392`
MD5	`9b4e4732e473b93694b895b3d41fe290`
BLAKE2b-256	`f9cd7b049b8814dcb29815a388d7bc9e4f0231cb6795d6cb427d4b4f81ca069e`

See more details on using hashes here.

slmjury 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

💡 What is SLMJury?

🌟 Key Highlights

🧠 Individual Judging

🗳️ Majority Voting

🤝 Multi-Agent Debate

⚡ Installation

🔧 From Source (Recommended)

🚀 Quick Start

💻 CLI Scripts

🐍 Python API

🤖 Supported Models

📊 Datasets

🏗️ Project Structure

🏆 Leaderboard

📖 Citation

📄 License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes