Skip to main content

Noesis Agentic Software Development Evals Toolkit

Project description

NASDE Toolkit

Noesis Agentic Software Development Evals Toolkit

Measure how your whole AI coding setup performs — and what it costs in tokens and dollars across models and providers — so you can choose where to invest and when to switch.

Product Page Join our Discord
CI License: MIT


Why NASDE?

Your team runs AI coding agents — but which setup is actually best for your codebase, and what is it costing you? Switch from Claude to Codex, swap a model, add a skill or an MCP server — and you're guessing whether quality went up, down, or just got more expensive.

The decisions that matter are getting expensive: which provider, which model, which configuration — each with a different quality-per-dollar trade-off. NASDE measures your whole harness — the agent, its skills, its MCP servers, against your tasks — and reports not just how good the output is, but how many tokens and how many dollars it took, per model and per provider. That's the data behind a real decision: where to invest, which model to standardize on, and when a migration actually pays off.

It runs on your own machine with a subscription you already have. Today it drives Claude Code, the Codex CLI, and the Gemini CLI; planned: Pi, Cursor, and router-based setups.

What NASDE does — in four steps

One nasde run command executes the whole chain.

  1. You describe a task you already understand. An instruction, a repo snapshot, and the assessment criteria describing what a good solution looks like. The output can be anything the agent writes into its workspace — code, a migration plan, an ADR, a SQL script, updated docs.
  2. The agent solves it in a sandbox. The agent works in a safe, isolated environment — it can't touch your machine or your real code. Every run starts from the same clean state, so different configurations get a fair comparison. When it's done, a quick test.sh check gives a rough pass/fail signal. Powered by Harbor, runs locally on Docker or in the cloud.
  3. A reviewer agent assesses the result against your criteria. After initial rough tests pass or fail, a second coding agent (claude or codex) navigates the workspace and scores your chosen dimensions (e.g. domain modeling, test quality) on whatever scale you picked. The review stays token-efficient even on large codebases.
  4. Results land in a dashboard (optional). Browse scores, compare variants, and track how your agent setup evolves over time — optionally via Opik.

You're the one defining "what good looks like." NASDE just automates running the experiment and assessing it the same way every time.

📖 Documentation

Full documentation lives at → noesisvision.github.io/nasde-toolkit

Concepts (how the scoring works, the evaluation pipeline, token & cost, rubric calibration), the complete CLI reference, every configuration-file format, authentication, and step-by-step guides — all there, searchable.

What do I use it for?

The core use is a cost-and-quality decision about your AI coding stack: which agent, which model, which provider, which configuration — for our codebase and our budget? NASDE answers it with numbers instead of vibes. Typical things you'd do with it:

  • Compare providers and models on quality and cost — Claude Code vs. Codex vs. Gemini, Sonnet vs. Opus, against your tasks; see the score and the tokens and dollars each one spends, and pick the best quality-per-dollar for your budget.
  • Decide whether a migration pays off — before standardizing on a new agent or model, measure what actually changes in output quality and in spend.
  • Measure your whole harness, not just one skill — run your real CLAUDE.md + skills + MCP servers as a unit and see how the full configuration performs.
  • Tune a single skill or config — baseline vs. "with my new skill"; see whether it moves the score up or down, and on which dimensions.
  • Build a regression suite for your AI setup — re-run the task set whenever someone tweaks the prompt/skills/MCP/model and catch quality or cost regressions before they ship.

Quick start

The fastest path from zero to a working benchmark built from your own git history:

# 1. Install the CLI
uv tool install nasde-toolkit --python 3.13
nasde --version

# 2. Install the authoring skills for Claude Code
nasde install-skills

Python version: we recommend --python 3.13 (3.12 is also supported). Python 3.14 is not yet supported — a transitive dependency hasn't released cp314 wheels.

Then, from inside your own repo, ask Claude Code:

"Create a NASDE benchmark with a single task, based on a recent piece of work from this repo — a commit, a range of commits, or a merged PR."

The nasde-benchmark-from-history skill proposes a good candidate and scaffolds the task files for you to review. Then run it:

nasde run --all-variants -C path/to/generated-benchmark

Start small — one task is enough to validate the loop end to end. Your existing claude / codex / gemini CLI auth covers it (a Claude Max or ChatGPT Plus subscription is enough). API keys work too.

→ Full walkthrough: Quick Start · Authentication & Opik

Authoring helpers (Claude Code skills)

Writing assessment_criteria.md, picking tasks from git history, and scaffolding Dockerfiles is the tedious part of building a benchmark. NASDE ships Claude Code skills that take care of most of it — install them with nasde install-skills:

Skill What it does
nasde-benchmark-creator Interactive end-to-end scaffolding: project layout, tasks, Dockerfiles, test scripts, assessment criteria.
nasde-benchmark-from-history Point it at a commit range, a merged PR, or a closed issue from your own repo — it proposes tasks based on work your team already finished, and writes the task files for you to review.
nasde-benchmark-from-public-repos Describe a skill you want to test broadly; it builds a diversity matrix of public repos (languages, sizes, styles) and scaffolds one task per cell.
nasde-benchmark-runner Guides running benchmarks, re-running the reviewer on existing results, verifying the experiment tracker, and troubleshooting failed runs.
nasde-benchmark-calibration Publishes trial diffs + scores as PRs/MRs, pulls your review comments back, and proposes concrete rubric edits — the human-in-the-loop calibration loop.

You don't have to use these — everything they do is just writing files you could write by hand — but they save a lot of typing.

Architecture

See ARCHITECTURE.md for the full system architecture with diagrams, and docs/adr/ for architectural decision records. Release notes live in CHANGELOG.md.

Key design: nasde is a thin integration layer over Harbor and Opik, not a replacement. Core flow uses their Python APIs directly; utility commands pass through to their CLIs unchanged.

Community

Have questions, want to share your benchmarks, or discuss AI agent evaluation strategies? Join our Discord community — we'd love to hear from you!

Discord

Security

Found a security issue? Please report it privately — see SECURITY.md for the reporting channels, response timeline, and what's in scope.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nasde_toolkit-0.5.0.tar.gz (5.9 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

nasde_toolkit-0.5.0-py3-none-any.whl (145.5 kB view details)

Uploaded Python 3

File details

Details for the file nasde_toolkit-0.5.0.tar.gz.

File metadata

  • Download URL: nasde_toolkit-0.5.0.tar.gz
  • Upload date:
  • Size: 5.9 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for nasde_toolkit-0.5.0.tar.gz
Algorithm Hash digest
SHA256 01f90b9fe52a6573c317d89747c72b48864889856118ec3448f27961778fc874
MD5 fd4dd80c2ae0d860aeeb0991d2acd770
BLAKE2b-256 ba43dafe7f42ab2d953e3800bb7ed81b7ebc8825c5228d16ee3f9f7a42e65b68

See more details on using hashes here.

Provenance

The following attestation bundles were made for nasde_toolkit-0.5.0.tar.gz:

Publisher: publish.yml on NoesisVision/nasde-toolkit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file nasde_toolkit-0.5.0-py3-none-any.whl.

File metadata

  • Download URL: nasde_toolkit-0.5.0-py3-none-any.whl
  • Upload date:
  • Size: 145.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for nasde_toolkit-0.5.0-py3-none-any.whl
Algorithm Hash digest
SHA256 deacd3954e46acad142c29a1b947c332adf350212e973a47fb5d102b220ef06e
MD5 fb9c89585e8ab6a8b9d0cfe1c2560106
BLAKE2b-256 12a3a18873adf09cc300e47629b6aada3bf969400018bd051e93363338721c49

See more details on using hashes here.

Provenance

The following attestation bundles were made for nasde_toolkit-0.5.0-py3-none-any.whl:

Publisher: publish.yml on NoesisVision/nasde-toolkit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page