Skip to main content

Benchmark assessment approaches: pure-LLM marking vs the family's signal-based observations, with repeated runs and agreement statistics.

Project description

assessment-bench

Part of the lens family.

Python 3.11+ License: MIT

Benchmark assessment approaches. Run one cohort through competing assessment arms — pure-LLM marking (the baseline) and the family's signal-based observations (assessment-lens) — with repeated runs, consistency statistics, and agreement against human marks. The bench measures; it never marks.

assessment-bench is a bench (a measurement product), not an -analyser and not a marking tool. It exists to answer research questions like: how consistent is LLM marking across repeated runs and providers? and which deterministic signals actually track human judgement?

What it does

experiment.yaml (rubric + cohort + arms)
  ├─ llm arm(s)    : submission + rubric → provider → score   × repetitions
  ├─ signals arm   : assessment-lens → evidence values        (deterministic, once)
  └─ human marks   : optional ground-truth CSV
        ↓
result.json + runs.csv + signals.csv + agreement.csv
  • per-submission consistency: mean / median / std-dev / CV / reliability
  • agreement: Pearson & Spearman of every arm mean and every numeric signal
    against the human marks

Install

# from source (family layout)
uv venv && source .venv/bin/activate
uv pip install -e ".[dev]"

# the signals arm needs the analyser stack (bundle-analyser CLI on PATH):
uv pip install -e ".[analysers]"

# LLM arms (Anthropic, OpenAI, Ollama, OpenRouter):
uv pip install -e ".[llm]"      # + export ANTHROPIC_API_KEY / OPENAI_API_KEY / ...

Quick start

assessment-bench init experiment.yaml   # commented example config
# edit: point at your rubric.yaml + submissions/, choose arms
assessment-bench run experiment.yaml -o out/

LLM arms specify provider and model per arm — comparing claude-haiku-4-5 vs gpt-4o-mini vs a local llama3.1 via Ollama is just three arms in one config.

Relationship to the family

  • Analysers generate deterministic signals (assessment-agnostic).
  • assessment-lens maps signals to a rubric as observations — never scores.
  • assessment-bench measures both approaches against human judgement. The LLM arm produces scores because that is the approach under test; the bench treats them as data points, not grades for students.

Status

v0.1 scaffold. Working today:

  • ✅ Experiment config (YAML) → cohort discovery → arms → structured results
  • ✅ LLM arm: multi-provider (anthropic / openai / ollama / openrouter), repeated runs, strict SCORE: x/y extraction with scaled fallback
  • ✅ Signals arm: one assessment-lens pass; raw evidence values consumed (not the presence-based coverage)
  • ✅ Consistency stats (ported from the original Rust prototype) + Pearson/Spearman agreement vs human marks
  • 📋 Hybrid arm (LLM marking with analyser signals in context) — next
  • 📋 HTTP service + desktop shell for non-technical researchers — planned

Development

pytest -v

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

assessment_bench-0.2.0.tar.gz (16.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

assessment_bench-0.2.0-py3-none-any.whl (19.7 kB view details)

Uploaded Python 3

File details

Details for the file assessment_bench-0.2.0.tar.gz.

File metadata

  • Download URL: assessment_bench-0.2.0.tar.gz
  • Upload date:
  • Size: 16.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.0

File hashes

Hashes for assessment_bench-0.2.0.tar.gz
Algorithm Hash digest
SHA256 1fbd01cd778b6f753301ee691378a7bfba3fa67c020be6194bea5ea9ea44093f
MD5 de0b4573b58783e8c242404ed584f049
BLAKE2b-256 6bfc2bfc526443b125b75d427f7fb2888704febd806ef1d0853efb36f2103dc2

See more details on using hashes here.

File details

Details for the file assessment_bench-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for assessment_bench-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8de995fe62f8dfe2334f7e6e47bd67bd55eaf2d7c4643d9d1c064e6e7a0ef38a
MD5 18eb5436e5365467080eb0c73305185f
BLAKE2b-256 d14c16d326630fd6aa59db70eee7f8b81b4983f9d37b8414c6de980667a8e4d6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page