22 projects
prompt-rubric-drift
Deterministic prompt and rubric drift detection for PR review.
lerobot-quality-gates
Training-readiness checks for LeRobot-style robotics datasets.
otel-eval-bridge
Bridge OpenTelemetry/Phoenix GenAI traces to eval cases and manifests.
failure-gallery
Synthetic gallery of reproducible agent and robotics failure cases.
vla-robustness-kit
Simulator-light diagnostics for VLA policy robustness.
embodiment-card
Structured embodiment cards for robot datasets and VLA releases.
robot-recovery-bench
Review metrics and schema for robot intervention and recovery segments.
agent-trace-card
Portable cards for reviewable agent traces and failures.
mcp-risk-linter
Readiness linter for MCP server tool, auth, filesystem, shell, network, and documentation risks.
tool-call-replay
Deterministic replay harness for agent tool-call traces.
a2a-contract-test
Offline contract tests for A2A-style agent cards and task lifecycle behavior.
eval-conformance-suite
Executable rubric-spec v1 conformance suite.
eval-adapter
Unified rubric and run adapter for common eval frameworks.
eval-run-manifest
Signed manifest envelope for eval runs.
contamination-audit
N-gram, embedding, canary, answer-pattern, and corpus contamination auditor.
synthetic-disagreement
Controlled synthetic annotator disagreement generator.
judge-bench
Diagnostic probes for LLM-as-judge reliability.
judge-card
Judge Card schema, validator, and renderer.
auraone-evalkit
Local open-source evaluation tooling for rubric validation, linting, and deterministic scoring.
iaa-kit
Modern inter-annotator agreement metrics with bootstrap confidence intervals.
rubric-spec
Portable AuraOne Rubric Schema v1 validator and adapters.
auraone-sdk
Official Python SDK and CLI for the AuraOne hosted API.