Last released Jun 30, 2026
Measure LLM-judge verdict drift across model versions by re-grading a stored Inspect eval log with two graders over the same samples.
Last released Jun 25, 2026
A claim-support / faithfulness scorer for Inspect AI — does the transcript actually substantiate the claimed answer?
Supported by