Benchmark assessment approaches: pure-LLM marking vs the family's signal-based observations, with repeated runs and agreement statistics.
Project description
assessment-bench
Part of the lens family.
Benchmark assessment approaches. Run one cohort through competing
assessment arms — pure-LLM marking (the baseline) and the family's
signal-based observations (assessment-lens) — with repeated runs,
consistency statistics, and agreement against human marks.
The bench measures; it never marks.
assessment-benchis a bench (a measurement product), not an-analyserand not a marking tool. It exists to answer research questions like: how consistent is LLM marking across repeated runs and providers? and which deterministic signals actually track human judgement?
What it does
experiment.yaml (rubric + cohort + arms)
├─ llm arm(s) : submission + rubric → provider → score × repetitions
├─ signals arm : assessment-lens → evidence values (deterministic, once)
└─ human marks : optional ground-truth CSV
↓
result.json + runs.csv + signals.csv + agreement.csv
• per-submission consistency: mean / median / std-dev / CV / reliability
• agreement: Pearson & Spearman of every arm mean and every numeric signal
against the human marks
Install
# from source (family layout)
uv venv && source .venv/bin/activate
uv pip install -e ".[dev]"
# the signals arm needs the analyser stack (bundle-analyser CLI on PATH):
uv pip install -e ".[analysers]"
# LLM arms (Anthropic, OpenAI, Ollama, OpenRouter):
uv pip install -e ".[llm]" # + export ANTHROPIC_API_KEY / OPENAI_API_KEY / ...
Quick start
assessment-bench init experiment.yaml # commented example config
# edit: point at your rubric.yaml + submissions/, choose arms
assessment-bench run experiment.yaml -o out/
LLM arms specify provider and model per arm — comparing
claude-haiku-4-5 vs gpt-4o-mini vs a local llama3.1 via Ollama is just
three arms in one config.
Relationship to the family
- Analysers generate deterministic signals (assessment-agnostic).
- assessment-lens maps signals to a rubric as observations — never scores.
- assessment-bench measures both approaches against human judgement. The LLM arm produces scores because that is the approach under test; the bench treats them as data points, not grades for students.
Status
v0.1 scaffold. Working today:
- ✅ Experiment config (YAML) → cohort discovery → arms → structured results
- ✅ LLM arm: multi-provider (anthropic / openai / ollama / openrouter), repeated
runs, strict
SCORE: x/yextraction with scaled fallback - ✅ Signals arm: one
assessment-lenspass; raw evidence values consumed (not the presence-based coverage) - ✅ Consistency stats (ported from the original Rust prototype) + Pearson/Spearman agreement vs human marks
- 📋 Hybrid arm (LLM marking with analyser signals in context) — next
- 📋 HTTP service + desktop shell for non-technical researchers — planned
Development
pytest -v
License
MIT — see LICENSE.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file assessment_bench-0.1.0.tar.gz.
File metadata
- Download URL: assessment_bench-0.1.0.tar.gz
- Upload date:
- Size: 13.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a89d429da92bb1507091939094b3138486eb468183c703da2cfc91b487b1082c
|
|
| MD5 |
16245f0fd461b025e59a7e92c070f834
|
|
| BLAKE2b-256 |
23cfaf910ddd940e7bafe1268455416b55a0a507ea221034c5bbb032878519d1
|
File details
Details for the file assessment_bench-0.1.0-py3-none-any.whl.
File metadata
- Download URL: assessment_bench-0.1.0-py3-none-any.whl
- Upload date:
- Size: 17.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6f41899edd43cf5743d8aee9e168c32f4a7a31fc2ef9d1aec9f1a089ca332efa
|
|
| MD5 |
4c8e89b8038bfed52c2e72d8c363adba
|
|
| BLAKE2b-256 |
cb7fd26bd115f550921d35ef24d0bb05be17f831e57e48507fda55ebc6671132
|