Skip to main content

Benchmark assessment approaches: pure-LLM marking vs the family's signal-based observations, with repeated runs and agreement statistics.

Project description

assessment-bench

Part of the lens family.

Python 3.11+ License: MIT

Benchmark assessment approaches. Run one cohort through competing assessment arms — pure-LLM marking (the baseline) and the family's signal-based observations (assessment-lens) — with repeated runs, consistency statistics, and agreement against human marks. The bench measures; it never marks.

assessment-bench is a bench (a measurement product), not an -analyser and not a marking tool. It exists to answer research questions like: how consistent is LLM marking across repeated runs and providers? and which deterministic signals actually track human judgement?

What it does

experiment.yaml (rubric + cohort + arms)
  ├─ llm arm(s)    : submission + rubric → provider → score             × repetitions
  ├─ hybrid arm(s) : submission + rubric + signals → provider → score   × repetitions
  ├─ signals arm   : assessment-lens → evidence values                  (deterministic, once)
  └─ human marks   : optional ground-truth CSV
        ↓
result.json + runs.csv + signals.csv + agreement.csv
  • per-submission consistency: mean / median / std-dev / CV / reliability
  • agreement: Pearson & Spearman of every arm mean and every numeric signal
    against the human marks

Install

# from source (family layout)
uv venv && source .venv/bin/activate
uv pip install -e ".[dev]"

# the signals arm needs the analyser stack (bundle-analyser CLI on PATH):
uv pip install -e ".[analysers]"

# LLM arms (Anthropic, OpenAI, Ollama, OpenRouter):
uv pip install -e ".[llm]"      # + export ANTHROPIC_API_KEY / OPENAI_API_KEY / ...

Quick start

assessment-bench init experiment.yaml   # commented example config
# edit: point at your rubric.yaml + submissions/, choose arms
assessment-bench run experiment.yaml -o out/

LLM arms specify provider and model per arm — comparing claude-haiku-4-5 vs gpt-4o-mini vs a local llama3.1 via Ollama is just three arms in one config.

Relationship to the family

  • Analysers generate deterministic signals (assessment-agnostic).
  • assessment-lens maps signals to a rubric as observations — never scores.
  • assessment-bench measures both approaches against human judgement. The LLM arm produces scores because that is the approach under test; the bench treats them as data points, not grades for students.

Status

v0.3.0 (on PyPI). Working today:

  • ✅ Experiment config (YAML) → cohort discovery → arms → structured results
  • ✅ LLM arm: multi-provider (anthropic / openai / ollama / openrouter), repeated runs, strict SCORE: x/y extraction with scaled fallback
  • ✅ Signals arm: one assessment-lens pass; raw evidence values consumed (not the presence-based coverage)
  • ✅ Consistency stats (ported from the original Rust prototype) + Pearson/Spearman agreement vs human marks
  • ✅ Hybrid arm — LLM marking with the deterministic signals in context (one assessment-lens pass per cohort, shared across signals/hybrid arms)
  • ✅ HTTP API (assessment-bench serve, the [serve] extra) — health/manifest contract routes plus background experiment runs for UIs
  • 📋 Desktop shell for non-technical researchers — planned

Development

pytest -v

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

assessment_bench-0.4.0.tar.gz (95.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

assessment_bench-0.4.0-py3-none-any.whl (21.9 kB view details)

Uploaded Python 3

File details

Details for the file assessment_bench-0.4.0.tar.gz.

File metadata

  • Download URL: assessment_bench-0.4.0.tar.gz
  • Upload date:
  • Size: 95.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.0

File hashes

Hashes for assessment_bench-0.4.0.tar.gz
Algorithm Hash digest
SHA256 0129338b4521f5cbaed5fa96286a2124f70755961e66f80574937212890e8ef9
MD5 adc5f70aa9a76c240610996e5461d99d
BLAKE2b-256 3d94e198767437e72a87152f5e1e0f1777ed4a1594d11cf65feaaff3e43726c8

See more details on using hashes here.

File details

Details for the file assessment_bench-0.4.0-py3-none-any.whl.

File metadata

File hashes

Hashes for assessment_bench-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6629e176463743a92368b5759e93ac9dbc0e84059c40b0128b53183e54657cfb
MD5 3428f9c81f68d21871c5faa5c6e27bd3
BLAKE2b-256 52ce54efd1696d92179dbd59799470b50391a28dc79fed82fb36c8ec35cbd402

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page