nasde-toolkit

Noesis Agentic Software Development Evals Toolkit

These details have not been verified by PyPI

Project links

Homepage

Project description

Noesis Agentic Software Development Evals Toolkit

Measure how your whole AI coding setup performs — and what it costs in tokens and dollars across models and providers — so you can choose where to invest and when to switch.

Why NASDE?

Your team runs AI coding agents — but which setup is actually best for your codebase, and what is it costing you? Switch from Claude to Codex, swap a model, add a skill or an MCP server — and you're guessing whether quality went up, down, or just got more expensive.

The decisions that matter are getting expensive: which provider, which model, which configuration — each with a different quality-per-dollar trade-off. NASDE measures your whole harness — the agent, its skills, its MCP servers, against your tasks — and reports not just how good the output is, but how many tokens and how many dollars it took, per model and per provider. That's the data behind a real decision: where to invest, which model to standardize on, and when a migration actually pays off.

It runs on your own machine with a subscription you already have. Today it drives Claude Code, the Codex CLI, and the Gemini CLI; planned: Pi, Cursor, and router-based setups.

What NASDE does — in four steps

One nasde run command executes the whole chain.

You describe a task you already understand. An instruction, a repo snapshot, and the assessment criteria describing what a good solution looks like. The output can be anything the agent writes into its workspace — code, a migration plan, an ADR, a SQL script, updated docs.
The agent solves it in a sandbox. The agent works in a safe, isolated environment — it can't touch your machine or your real code. Every run starts from the same clean state, so different configurations get a fair comparison. When it's done, a quick test.sh check gives a rough pass/fail signal. Powered by Harbor, runs locally on Docker or in the cloud.
A reviewer agent assesses the result against your criteria. After initial rough tests pass or fail, a second coding agent (claude or codex) navigates the workspace and scores your chosen dimensions (e.g. domain modeling, test quality) on whatever scale you picked. The review stays token-efficient even on large codebases.
Results land in a dashboard (optional). Browse scores, compare variants, and track how your agent setup evolves over time — optionally via Opik.

You're the one defining "what good looks like." NASDE just automates running the experiment and assessing it the same way every time.

📖 Documentation

Full documentation lives at → noesisvision.github.io/nasde-toolkit

Concepts (how the scoring works, the evaluation pipeline, token & cost, rubric calibration), the complete CLI reference, every configuration-file format, authentication, and step-by-step guides — all there, searchable.

Quick Start — zero to a working benchmark from your own git history
How it works — the two independent kinds of scoring, end to end
A real task, end to end — instruction, criteria, and scores
CLI reference — the full command reference
Use Cases · Benchmark Results — worked examples with numbers

What do I use it for?

The core use is a cost-and-quality decision about your AI coding stack: which agent, which model, which provider, which configuration — for our codebase and our budget? NASDE answers it with numbers instead of vibes. Typical things you'd do with it:

Compare providers and models on quality and cost — Claude Code vs. Codex vs. Gemini, Sonnet vs. Opus, against your tasks; see the score and the tokens and dollars each one spends, and pick the best quality-per-dollar for your budget.
Decide whether a migration pays off — before standardizing on a new agent or model, measure what actually changes in output quality and in spend.
Measure your whole harness, not just one skill — run your real CLAUDE.md + skills + MCP servers as a unit and see how the full configuration performs.
Tune a single skill or config — baseline vs. "with my new skill"; see whether it moves the score up or down, and on which dimensions.
Build a regression suite for your AI setup — re-run the task set whenever someone tweaks the prompt/skills/MCP/model and catch quality or cost regressions before they ship.

Quick start

The fastest path from zero to a working benchmark built from your own git history:

# 1. Install the CLI
uv tool install nasde-toolkit --python 3.13
nasde --version

# 2. Install the authoring skills for Claude Code
nasde install-skills

Python version: we recommend --python 3.13 (3.12 is also supported). Python 3.14 is not yet supported — a transitive dependency hasn't released cp314 wheels.

Then, from inside your own repo, ask Claude Code:

"Create a NASDE benchmark with a single task, based on a recent piece of work from this repo — a commit, a range of commits, or a merged PR."

The nasde-benchmark-from-history skill proposes a good candidate and scaffolds the task files for you to review. Then run it:

nasde run --all-variants -C path/to/generated-benchmark

Start small — one task is enough to validate the loop end to end. Your existing claude / codex / gemini CLI auth covers it (a Claude Max or ChatGPT Plus subscription is enough). API keys work too.

→ Full walkthrough: Quick Start · Authentication & Opik

Authoring helpers (Claude Code skills)

Writing assessment_criteria.md, picking tasks from git history, and scaffolding Dockerfiles is the tedious part of building a benchmark. NASDE ships Claude Code skills that take care of most of it — install them with nasde install-skills:

Skill	What it does
nasde-benchmark-creator	Interactive end-to-end scaffolding: project layout, tasks, Dockerfiles, test scripts, assessment criteria.
nasde-benchmark-from-history	Point it at a commit range, a merged PR, or a closed issue from your own repo — it proposes tasks based on work your team already finished, and writes the task files for you to review.
nasde-benchmark-from-public-repos	Describe a skill you want to test broadly; it builds a diversity matrix of public repos (languages, sizes, styles) and scaffolds one task per cell.
nasde-benchmark-runner	Guides running benchmarks, re-running the reviewer on existing results, verifying the experiment tracker, and troubleshooting failed runs.
nasde-benchmark-calibration	Publishes trial diffs + scores as PRs/MRs, pulls your review comments back, and proposes concrete rubric edits — the human-in-the-loop calibration loop.

You don't have to use these — everything they do is just writing files you could write by hand — but they save a lot of typing.

Architecture

See ARCHITECTURE.md for the full system architecture with diagrams, and docs/adr/ for architectural decision records. Release notes live in CHANGELOG.md.

Key design: nasde is a thin integration layer over Harbor and Opik, not a replacement. Core flow uses their Python APIs directly; utility commands pass through to their CLIs unchanged.

Community

Have questions, want to share your benchmarks, or discuss AI agent evaluation strategies? Join our Discord community — we'd love to hear from you!

Security

Found a security issue? Please report it privately — see SECURITY.md for the reporting channels, response timeline, and what's in scope.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.5.0

Jun 24, 2026

0.4.0

May 21, 2026

0.3.3

May 9, 2026

0.3.2

May 8, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nasde_toolkit-0.5.0.tar.gz (5.9 MB view details)

Uploaded Jun 24, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

nasde_toolkit-0.5.0-py3-none-any.whl (145.5 kB view details)

Uploaded Jun 24, 2026 Python 3

File details

Details for the file nasde_toolkit-0.5.0.tar.gz.

File metadata

Download URL: nasde_toolkit-0.5.0.tar.gz
Upload date: Jun 24, 2026
Size: 5.9 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for nasde_toolkit-0.5.0.tar.gz
Algorithm	Hash digest
SHA256	`01f90b9fe52a6573c317d89747c72b48864889856118ec3448f27961778fc874`
MD5	`fd4dd80c2ae0d860aeeb0991d2acd770`
BLAKE2b-256	`ba43dafe7f42ab2d953e3800bb7ed81b7ebc8825c5228d16ee3f9f7a42e65b68`

See more details on using hashes here.

Provenance

The following attestation bundles were made for nasde_toolkit-0.5.0.tar.gz:

Publisher: publish.yml on NoesisVision/nasde-toolkit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: nasde_toolkit-0.5.0.tar.gz
- Subject digest: 01f90b9fe52a6573c317d89747c72b48864889856118ec3448f27961778fc874
- Sigstore transparency entry: 1937971344
- Sigstore integration time: Jun 24, 2026
Source repository:
- Permalink: NoesisVision/nasde-toolkit@b8104ad8d4d8c51a326eaae7661d758d00ca1ddd
- Branch / Tag: refs/tags/v0.5.0
- Owner: https://github.com/NoesisVision
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@b8104ad8d4d8c51a326eaae7661d758d00ca1ddd
- Trigger Event: push

File details

Details for the file nasde_toolkit-0.5.0-py3-none-any.whl.

File metadata

Download URL: nasde_toolkit-0.5.0-py3-none-any.whl
Upload date: Jun 24, 2026
Size: 145.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for nasde_toolkit-0.5.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`deacd3954e46acad142c29a1b947c332adf350212e973a47fb5d102b220ef06e`
MD5	`fb9c89585e8ab6a8b9d0cfe1c2560106`
BLAKE2b-256	`12a3a18873adf09cc300e47629b6aada3bf969400018bd051e93363338721c49`

See more details on using hashes here.

Provenance

The following attestation bundles were made for nasde_toolkit-0.5.0-py3-none-any.whl:

Publisher: publish.yml on NoesisVision/nasde-toolkit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: nasde_toolkit-0.5.0-py3-none-any.whl
- Subject digest: deacd3954e46acad142c29a1b947c332adf350212e973a47fb5d102b220ef06e
- Sigstore transparency entry: 1937971478
- Sigstore integration time: Jun 24, 2026
Source repository:
- Permalink: NoesisVision/nasde-toolkit@b8104ad8d4d8c51a326eaae7661d758d00ca1ddd
- Branch / Tag: refs/tags/v0.5.0
- Owner: https://github.com/NoesisVision
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@b8104ad8d4d8c51a326eaae7661d758d00ca1ddd
- Trigger Event: push

nasde-toolkit 0.5.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Noesis Agentic Software Development Evals Toolkit

Why NASDE?

What NASDE does — in four steps

📖 Documentation

What do I use it for?

Quick start

Authoring helpers (Claude Code skills)

Architecture

Community

Security

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance