Noesis Agentic Software Development Evals Toolkit
Project description
Noesis Agentic Software Development Evals Toolkit
Measure how your whole AI coding setup performs — and what it costs in tokens and dollars across models and providers — so you can choose where to invest and when to switch.
Why NASDE?
Your team runs AI coding agents — but which setup is actually best for your codebase, and what is it costing you? Switch from Claude to Codex, swap a model, add a skill or an MCP server — and you're guessing whether quality went up, down, or just got more expensive.
The decisions that matter are getting expensive: which provider, which model, which configuration — each with a different quality-per-dollar trade-off. NASDE measures your whole harness — the agent, its skills, its MCP servers, against your tasks — and reports not just how good the output is, but how many tokens and how many dollars it took, per model and per provider. That's the data behind a real decision: where to invest, which model to standardize on, and when a migration actually pays off.
It runs on your own machine with a subscription you already have. Today it drives Claude Code, the Codex CLI, and the Gemini CLI; planned: Pi, Cursor, and router-based setups.
What NASDE does — in four steps
One nasde run command executes the whole chain.
- You describe a task you already understand. An instruction, a repo snapshot, and the assessment criteria describing what a good solution looks like. The output can be anything the agent writes into its workspace — code, a migration plan, an ADR, a SQL script, updated docs.
- The agent solves it in a sandbox. The agent works in a safe, isolated environment — it can't touch your machine or your real code. Every run starts from the same clean state, so different configurations get a fair comparison. When it's done, a quick
test.shcheck gives a rough pass/fail signal. Powered by Harbor, runs locally on Docker or in the cloud. - A reviewer agent assesses the result against your criteria. After initial rough tests pass or fail, a second coding agent (
claudeorcodex) navigates the workspace and scores your chosen dimensions (e.g. domain modeling, test quality) on whatever scale you picked. The review stays token-efficient even on large codebases. - Results land in a dashboard (optional). Browse scores, compare variants, and track how your agent setup evolves over time — optionally via Opik.
You're the one defining "what good looks like." NASDE just automates running the experiment and assessing it the same way every time.
📖 Documentation
Full documentation lives at → noesisvision.github.io/nasde-toolkit
Concepts (how the scoring works, the evaluation pipeline, token & cost, rubric calibration), the complete CLI reference, every configuration-file format, authentication, and step-by-step guides — all there, searchable.
- Quick Start — zero to a working benchmark from your own git history
- How it works — the two independent kinds of scoring, end to end
- A real task, end to end — instruction, criteria, and scores
- CLI reference — the full command reference
- Use Cases · Benchmark Results — worked examples with numbers
What do I use it for?
The core use is a cost-and-quality decision about your AI coding stack: which agent, which model, which provider, which configuration — for our codebase and our budget? NASDE answers it with numbers instead of vibes. Typical things you'd do with it:
- Compare providers and models on quality and cost — Claude Code vs. Codex vs. Gemini, Sonnet vs. Opus, against your tasks; see the score and the tokens and dollars each one spends, and pick the best quality-per-dollar for your budget.
- Decide whether a migration pays off — before standardizing on a new agent or model, measure what actually changes in output quality and in spend.
- Measure your whole harness, not just one skill — run your real
CLAUDE.md+ skills + MCP servers as a unit and see how the full configuration performs. - Tune a single skill or config — baseline vs. "with my new skill"; see whether it moves the score up or down, and on which dimensions.
- Build a regression suite for your AI setup — re-run the task set whenever someone tweaks the prompt/skills/MCP/model and catch quality or cost regressions before they ship.
Quick start
The fastest path from zero to a working benchmark built from your own git history:
# 1. Install the CLI
uv tool install nasde-toolkit --python 3.13
nasde --version
# 2. Install the authoring skills for Claude Code
nasde install-skills
Python version: we recommend
--python 3.13(3.12is also supported). Python 3.14 is not yet supported — a transitive dependency hasn't released cp314 wheels.
Then, from inside your own repo, ask Claude Code:
"Create a NASDE benchmark with a single task, based on a recent piece of work from this repo — a commit, a range of commits, or a merged PR."
The nasde-benchmark-from-history skill proposes a good candidate and scaffolds the task files for you to review. Then run it:
nasde run --all-variants -C path/to/generated-benchmark
Start small — one task is enough to validate the loop end to end. Your existing claude / codex / gemini CLI auth covers it (a Claude Max or ChatGPT Plus subscription is enough). API keys work too.
→ Full walkthrough: Quick Start · Authentication & Opik
Authoring helpers (Claude Code skills)
Writing assessment_criteria.md, picking tasks from git history, and scaffolding Dockerfiles is the tedious part of building a benchmark. NASDE ships Claude Code skills that take care of most of it — install them with nasde install-skills:
| Skill | What it does |
|---|---|
| nasde-benchmark-creator | Interactive end-to-end scaffolding: project layout, tasks, Dockerfiles, test scripts, assessment criteria. |
| nasde-benchmark-from-history | Point it at a commit range, a merged PR, or a closed issue from your own repo — it proposes tasks based on work your team already finished, and writes the task files for you to review. |
| nasde-benchmark-from-public-repos | Describe a skill you want to test broadly; it builds a diversity matrix of public repos (languages, sizes, styles) and scaffolds one task per cell. |
| nasde-benchmark-runner | Guides running benchmarks, re-running the reviewer on existing results, verifying the experiment tracker, and troubleshooting failed runs. |
| nasde-benchmark-calibration | Publishes trial diffs + scores as PRs/MRs, pulls your review comments back, and proposes concrete rubric edits — the human-in-the-loop calibration loop. |
You don't have to use these — everything they do is just writing files you could write by hand — but they save a lot of typing.
Architecture
See ARCHITECTURE.md for the full system architecture with diagrams, and docs/adr/ for architectural decision records. Release notes live in CHANGELOG.md.
Key design: nasde is a thin integration layer over Harbor and Opik, not a replacement. Core flow uses their Python APIs directly; utility commands pass through to their CLIs unchanged.
Community
Have questions, want to share your benchmarks, or discuss AI agent evaluation strategies? Join our Discord community — we'd love to hear from you!
Security
Found a security issue? Please report it privately — see SECURITY.md for the reporting channels, response timeline, and what's in scope.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file nasde_toolkit-0.5.0.tar.gz.
File metadata
- Download URL: nasde_toolkit-0.5.0.tar.gz
- Upload date:
- Size: 5.9 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
01f90b9fe52a6573c317d89747c72b48864889856118ec3448f27961778fc874
|
|
| MD5 |
fd4dd80c2ae0d860aeeb0991d2acd770
|
|
| BLAKE2b-256 |
ba43dafe7f42ab2d953e3800bb7ed81b7ebc8825c5228d16ee3f9f7a42e65b68
|
Provenance
The following attestation bundles were made for nasde_toolkit-0.5.0.tar.gz:
Publisher:
publish.yml on NoesisVision/nasde-toolkit
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
nasde_toolkit-0.5.0.tar.gz -
Subject digest:
01f90b9fe52a6573c317d89747c72b48864889856118ec3448f27961778fc874 - Sigstore transparency entry: 1937971344
- Sigstore integration time:
-
Permalink:
NoesisVision/nasde-toolkit@b8104ad8d4d8c51a326eaae7661d758d00ca1ddd -
Branch / Tag:
refs/tags/v0.5.0 - Owner: https://github.com/NoesisVision
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@b8104ad8d4d8c51a326eaae7661d758d00ca1ddd -
Trigger Event:
push
-
Statement type:
File details
Details for the file nasde_toolkit-0.5.0-py3-none-any.whl.
File metadata
- Download URL: nasde_toolkit-0.5.0-py3-none-any.whl
- Upload date:
- Size: 145.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
deacd3954e46acad142c29a1b947c332adf350212e973a47fb5d102b220ef06e
|
|
| MD5 |
fb9c89585e8ab6a8b9d0cfe1c2560106
|
|
| BLAKE2b-256 |
12a3a18873adf09cc300e47629b6aada3bf969400018bd051e93363338721c49
|
Provenance
The following attestation bundles were made for nasde_toolkit-0.5.0-py3-none-any.whl:
Publisher:
publish.yml on NoesisVision/nasde-toolkit
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
nasde_toolkit-0.5.0-py3-none-any.whl -
Subject digest:
deacd3954e46acad142c29a1b947c332adf350212e973a47fb5d102b220ef06e - Sigstore transparency entry: 1937971478
- Sigstore integration time:
-
Permalink:
NoesisVision/nasde-toolkit@b8104ad8d4d8c51a326eaae7661d758d00ca1ddd -
Branch / Tag:
refs/tags/v0.5.0 - Owner: https://github.com/NoesisVision
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@b8104ad8d4d8c51a326eaae7661d758d00ca1ddd -
Trigger Event:
push
-
Statement type: