Skip to main content

Benchmark assessment approaches: pure-LLM marking vs the family's signal-based observations, with repeated runs and agreement statistics.

Project description

assessment-bench

Part of the lens family.

Python 3.11+ License: MIT

Benchmark assessment approaches. Run one cohort through competing assessment arms — pure-LLM marking (the baseline) and the family's signal-based observations (assessment-lens) — with repeated runs, consistency statistics, and agreement against human marks. The bench measures; it never marks.

assessment-bench is a bench (a measurement product), not an -analyser and not a marking tool. It exists to answer research questions like: how consistent is LLM marking across repeated runs and providers? and which deterministic signals actually track human judgement?

What it does

experiment.yaml (rubric + cohort + arms)
  ├─ llm arm(s)    : submission + rubric → provider → score   × repetitions
  ├─ signals arm   : assessment-lens → evidence values        (deterministic, once)
  └─ human marks   : optional ground-truth CSV
        ↓
result.json + runs.csv + signals.csv + agreement.csv
  • per-submission consistency: mean / median / std-dev / CV / reliability
  • agreement: Pearson & Spearman of every arm mean and every numeric signal
    against the human marks

Install

# from source (family layout)
uv venv && source .venv/bin/activate
uv pip install -e ".[dev]"

# the signals arm needs the analyser stack (bundle-analyser CLI on PATH):
uv pip install -e ".[analysers]"

# LLM arms (Anthropic, OpenAI, Ollama, OpenRouter):
uv pip install -e ".[llm]"      # + export ANTHROPIC_API_KEY / OPENAI_API_KEY / ...

Quick start

assessment-bench init experiment.yaml   # commented example config
# edit: point at your rubric.yaml + submissions/, choose arms
assessment-bench run experiment.yaml -o out/

LLM arms specify provider and model per arm — comparing claude-haiku-4-5 vs gpt-4o-mini vs a local llama3.1 via Ollama is just three arms in one config.

Relationship to the family

  • Analysers generate deterministic signals (assessment-agnostic).
  • assessment-lens maps signals to a rubric as observations — never scores.
  • assessment-bench measures both approaches against human judgement. The LLM arm produces scores because that is the approach under test; the bench treats them as data points, not grades for students.

Status

v0.1 scaffold. Working today:

  • ✅ Experiment config (YAML) → cohort discovery → arms → structured results
  • ✅ LLM arm: multi-provider (anthropic / openai / ollama / openrouter), repeated runs, strict SCORE: x/y extraction with scaled fallback
  • ✅ Signals arm: one assessment-lens pass; raw evidence values consumed (not the presence-based coverage)
  • ✅ Consistency stats (ported from the original Rust prototype) + Pearson/Spearman agreement vs human marks
  • 📋 Hybrid arm (LLM marking with analyser signals in context) — next
  • 📋 HTTP service + desktop shell for non-technical researchers — planned

Development

pytest -v

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

assessment_bench-0.1.0.tar.gz (13.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

assessment_bench-0.1.0-py3-none-any.whl (17.2 kB view details)

Uploaded Python 3

File details

Details for the file assessment_bench-0.1.0.tar.gz.

File metadata

  • Download URL: assessment_bench-0.1.0.tar.gz
  • Upload date:
  • Size: 13.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.0

File hashes

Hashes for assessment_bench-0.1.0.tar.gz
Algorithm Hash digest
SHA256 a89d429da92bb1507091939094b3138486eb468183c703da2cfc91b487b1082c
MD5 16245f0fd461b025e59a7e92c070f834
BLAKE2b-256 23cfaf910ddd940e7bafe1268455416b55a0a507ea221034c5bbb032878519d1

See more details on using hashes here.

File details

Details for the file assessment_bench-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for assessment_bench-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6f41899edd43cf5743d8aee9e168c32f4a7a31fc2ef9d1aec9f1a089ca332efa
MD5 4c8e89b8038bfed52c2e72d8c363adba
BLAKE2b-256 cb7fd26bd115f550921d35ef24d0bb05be17f831e57e48507fda55ebc6671132

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page