NeuroDock eval corpora and harness — versioned datasets for translation, skills, and guardrails.
Project description
neurodock-evals
The versioned eval corpora and the air-gapped harness that runs ND prompts against them.
The corpus is the strategic asset that makes the translation layer honest. We prove that ND-aware prompts help neurodivergent users in real situations, and we catch regressions when prompts change. The harness gates prompt PRs in CI.
This package is v0.0.1 — the scaffold, the harness, and 6-10 hand-authored seed examples. The seeds are synthesised by to demonstrate the format — they are NOT real corporate messages. Real contributed corpora arrive over Phase 2 (target ~300 examples by month 6, per ).
What's here
packages/evals/
├── src/neurodock_evals/ # Harness, anonymiser, deduper, scorer
├── corpora/ # Versioned YAML eval examples by slice
├── schemas/ # JSON Schemas for examples + annotations
└── tests/ # Tests for the harness itself
Quick start
Run the harness against the seed corpora:
uv run python -m neurodock_evals.harness --corpus translation/incoming \
--tool translate_incoming
Run all four translation slices:
uv run python -m neurodock_evals.harness --ci
Anonymise a contribution before opening a PR:
uv run python -m neurodock_evals.anonymise path/to/example.yaml
Air-gapped by design
The harness never calls an LLM. It exercises each tool's deterministic
baseline (the heuristic layer the translation server returns even before any
LLM refinement) and scores the baseline against the human-rated expected
block. Any LLM-side eval is a separate concern that the maintainer reviews
under a different policy.
Privacy
- The harness never logs example contents to stdout or to anywhere outside
.eval-reports/. - Reports contain example IDs and scores only — never verbatim text.
- The contribution pipeline (
anonymise.py) is a safety net, NOT a substitute for contributor judgement. SeeCONTRIBUTING.md. - All corpora are licensed AGPL-3.0-or-later.
Glossary
| Term | Meaning |
|---|---|
| corpus slice | a directory under corpora/<server>/<slice>/; the unit of versioning |
| example | one YAML file under a slice — one input, one expected block, multiple ratings |
| rating | one ND-rater's judgement of how close the expected block matches their read |
| deterministic baseline | the heuristic output a translation tool returns without invoking an LLM |
| eval-corpus binding | every mcp-translation tool cites the slice that validates it (ADR 0005 §4) |
Status
- v0.0.1 (current): scaffold + harness + 10 synthesised seed examples
- v0.0.2 (planned): first contributed corpus (after )
- v0.1.0 (planned): HuggingFace publication pipeline under the
neurodockorg
See CHANGELOG.md for detail.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file neurodock_evals-0.0.1.tar.gz.
File metadata
- Download URL: neurodock_evals-0.0.1.tar.gz
- Upload date:
- Size: 25.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.14 {"installer":{"name":"uv","version":"0.11.14","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b050315d30d7ee44da08f5cb10256ed48c6e8e87d6fb84814ab1e8518152e359
|
|
| MD5 |
f5388483ff8bfdcaf43af628373e2b02
|
|
| BLAKE2b-256 |
ba1d1e494043202b0df518cf4fde482536130057f7e8fdfc1943025d71db058c
|