CLI tool for validating, benchmarking, and optimizing Sage agent configurations

Project description

Sage Evaluator

CLI tool for validating, benchmarking, and optimizing Sage agent configurations.

Why This Tool Exists

Building effective AI agents is iterative. You write a system prompt, choose a model, wire up tools, and hope for the best. But how do you know if your configuration is correct? How do you pick between GPT-4o and Claude when both "seem fine"? How do you catch the configuration mistake that makes your agent silently worse?

Sage Evaluator exists to bring rigor to that process. It provides four capabilities that address the core challenges of agent development:

Validation -- Catch configuration errors before they reach production. Typos in tool names, missing frontmatter fields, unreachable subagent paths, and suspicious defaults are surfaced immediately instead of manifesting as mysterious runtime failures.
Benchmarking -- Compare models objectively. Rather than gut-checking outputs by hand, run the same agent intent across multiple models and get back token usage, latency, cost estimates, and LLM-as-judge quality scores in a single report.
Suggestion -- Get actionable feedback on your agent configuration. The analyzer identifies prompt improvements, opportunities to extract logic into tools, guardrail candidates, and architectural changes -- then optionally generates the code.
Comparison -- A/B test configuration changes. Run two versions of the same agent against identical conditions and see exactly what changed in quality, cost, and speed.

Requirements

Python 3.10+
uv package manager
Azure credentials (for discover and model access via Azure Cognitive Services)

Installation

uv tool install sage-evaluator

Development Setup

git clone <repository-url>
cd sage-evaluator
make install

This runs uv sync --frozen --group dev and installs pre-commit hooks for linting and commit message validation.

If you need to update dependencies:

make update

Configuration

Create a .env file in the project root (or export these variables):

# Required for Azure model access
AZURE_AI_API_BASE=https://<your-endpoint>.services.ai.azure.com

# Required for the discover command (account name, not full URL)
AZURE_AI_ACCOUNT_NAME=<your-account-name>

# Optional -- used by discover to skip auto-discovery
AZURE_SUBSCRIPTION_ID=<your-subscription-id>
AZURE_RESOURCE_GROUP=<your-resource-group>

# Optional -- defaults to azure_ai/claude-opus-4-6
EVALUATOR_MODEL=azure_ai/claude-opus-4-6

Commands

`evaluate validate`

Validates agent and skill markdown configuration files. Runs three levels of checks:

Structural -- YAML frontmatter parsing and Pydantic model validation
Semantic -- Model identifier format, tool name verification against known built-ins, subagent path resolution
Best-practice -- Heuristic warnings (default max_turns, short prompt bodies, missing tools)

# Validate a single file
evaluate validate ./my-agent/AGENTS.md

# Validate a directory (looks for AGENTS.md inside)
evaluate validate ./my-agent

# Validate multiple paths
evaluate validate ./agent-a ./agent-b ./shared-skill.md

# Strict mode: treat warnings as errors
evaluate validate ./my-agent --strict

# JSON output (for CI pipelines)
evaluate validate ./my-agent --format json

Exit codes: 0 if all files pass, 1 if any file has errors.

`evaluate discover`

Lists models deployed in an Azure Cognitive Services account. Useful for seeing what's available before benchmarking. Subscription and resource group are auto-discovered from the account name unless explicitly provided.

# List deployed models (auto-discovers subscription and resource group)
evaluate discover --account-name my-aisvcs-account

# Include per-token pricing
evaluate discover --account-name $AZURE_AI_ACCOUNT_NAME --include-pricing

# Skip auto-discovery by providing subscription and resource group
evaluate discover --account-name $AZURE_AI_ACCOUNT_NAME \
  --subscription $AZURE_SUBSCRIPTION_ID \
  --resource-group $AZURE_RESOURCE_GROUP

# JSON output
evaluate discover --account-name $AZURE_AI_ACCOUNT_NAME --include-pricing --format json

Option	Default	Description
`--account-name`	(required)	Azure Cognitive Services account name (or `AZURE_AI_ACCOUNT_NAME` env var)
`--subscription`	(auto-discovered)	Azure subscription ID (or `AZURE_SUBSCRIPTION_ID` env var)
`--resource-group`	(auto-discovered)	Azure resource group (or `AZURE_RESOURCE_GROUP` env var)
`--include-pricing`	`false`	Enrich each model with per-token pricing
`--format`	`text`	Output format: `text` or `json`

Pricing is resolved through a 3-tier lookup: litellm data, a hard-coded fallback table for common models, or reported as unknown.

`evaluate benchmark`

Benchmarks an agent configuration across one or more models. This is the core workflow:

Elaborates the user intent via LLM (clarifying expected outcomes and evaluation criteria)
Runs the agent with each specified model (multiple times if --runs > 1)
Captures metrics: token usage, latency, tool calls
Scores outputs using an LLM-as-judge against a rubric
Estimates costs using pricing data
Ranks models by weighted composite score

# Basic benchmark
evaluate benchmark ./my-agent/AGENTS.md \
  --models gpt-4o azure_ai/claude-opus-4-6 \
  --intent "Answer user questions about Python programming"

# Multiple runs with code generation rubric
evaluate benchmark ./my-agent/AGENTS.md \
  --models gpt-4o azure_ai/claude-opus-4-6 \
  --intent "Generate a REST API for a todo app" \
  --rubric code_generation \
  --runs 3

# Skip quality scoring (metrics only)
evaluate benchmark ./my-agent/AGENTS.md \
  --models gpt-4o \
  --intent "Summarize a document" \
  --no-judge

# Save report to file
evaluate benchmark ./my-agent/AGENTS.md \
  --models gpt-4o \
  --intent "Debug a failing test" \
  --output report.json

Option	Default	Description
`--models`, `-m`	(required)	Model identifiers to benchmark (repeatable)
`--intent`	(required)	User intent to benchmark against
`--rubric`	`default`	Built-in name or path to YAML rubric
`--evaluator-model`	`azure_ai/claude-opus-4-6`	Model for intent elaboration and judging
`--runs`	`1`	Number of runs per model
`--no-judge`	`false`	Skip LLM-as-judge evaluation
`--account-name`	(none)	Azure Cognitive Services account name for model discovery
`--subscription`	(none)	Azure subscription ID (used with `--account-name`)
`--resource-group`	(none)	Azure resource group (used with `--account-name`)
`--output`	(none)	Save report to JSON file
`--format`	`text`	Output format: `text` or `json`

`evaluate suggest`

Analyzes an agent configuration and returns optimization suggestions across four categories:

Prompt improvement -- Wording, structure, and clarity changes
Tool extraction -- Logic that should be moved from the prompt into @tool functions
Guardrail -- Input/output validation that should be enforced programmatically
Architecture -- Structural changes (subagent decomposition, model selection, etc.)

# Analyze and get suggestions
evaluate suggest ./my-agent/AGENTS.md

# Generate @tool function code from suggestions
evaluate suggest ./my-agent --generate-tools

# Generate guardrail validation functions
evaluate suggest ./my-agent --generate-guardrails

# Both, with JSON output
evaluate suggest ./my-agent \
  --generate-tools --generate-guardrails \
  --format json --output suggestions.json

Option	Default	Description
`--analyzer-model`	`azure_ai/claude-opus-4-6`	Model for analysis
`--generate-tools`	`false`	Generate `@tool` function code
`--generate-guardrails`	`false`	Generate guardrail validation functions
`--output`	(none)	Save report to JSON file
`--format`	`text`	Output format: `text` or `json`

`evaluate compare`

Runs two agent configurations through the same benchmark and produces a side-by-side comparison. Useful for A/B testing configuration changes.

evaluate compare ./agent-v1/AGENTS.md ./agent-v2/AGENTS.md \
  --models gpt-4o azure_ai/claude-opus-4-6 \
  --intent "Answer user questions about Python" \
  --output comparison.json

Accepts the same options as benchmark (--models, --rubric, --evaluator-model, --runs, --no-judge, --account-name, --subscription, --resource-group, --output, --format).

Evaluation Rubrics

The benchmark command scores agent outputs using an LLM-as-judge against a rubric. Three rubrics are built in:

default -- General-purpose evaluation:

Dimension	Weight	Description
relevance	2.0	How well the output addresses the user's intent
accuracy	2.0	Factual and technical correctness
completeness	1.5	Whether all aspects of the request are covered
clarity	1.0	Structure and readability
efficiency	1.0	Appropriate tool use, no unnecessary steps

code_generation -- For code tasks:

Dimension	Weight	Description
correctness	2.5	Functional correctness and edge case handling
completeness	2.0	All requested functionality implemented
code_quality	1.5	Naming, structure, DRY principles
security	1.5	Avoids common vulnerabilities
documentation	1.0	Comments and docstrings

qa -- For question-answering:

Dimension	Weight	Description
accuracy	2.5	Factual correctness
relevance	2.0	Directly addresses the question
depth	1.5	Thoroughness of explanation
source_usage	1.0	Use of tools and references
conciseness	1.0	Avoids unnecessary verbosity

Custom Rubrics

Create a YAML file and pass it via --rubric path/to/rubric.yaml:

name: my_rubric
description: Custom rubric for my use case
dimensions:
  - name: accuracy
    description: Factual correctness of the response
    weight: 2.0
  - name: tone
    description: Professional and helpful tone
    weight: 1.5
  - name: actionability
    description: Provides clear next steps
    weight: 1.0

Architecture

sage_evaluator/
├── cli/
│   └── main.py              # Click CLI with 5 commands
├── validation/
│   └── validator.py          # 3-level config validation
├── benchmark/
│   ├── engine.py             # Orchestrates the benchmark pipeline
│   ├── runner.py             # InstrumentedProvider for metrics capture
│   └── collector.py          # Multi-run metric aggregation
├── evaluation/
│   ├── judge.py              # LLM-as-judge scoring
│   └── rubrics.py            # Built-in and YAML rubric loading
├── discovery/
│   ├── azure_models.py       # Azure AI Foundry model discovery
│   └── pricing.py            # 3-tier pricing lookup
├── suggestion/
│   ├── analyzer.py           # Prompt and config analysis
│   ├── tool_generator.py     # @tool function code generation
│   └── guardrail_generator.py  # Guardrail function generation
├── reporting/
│   ├── terminal.py           # Rich terminal output
│   └── json_export.py        # JSON report serialization
├── models.py                 # All Pydantic data models
└── exceptions.py             # Exception hierarchy

Key design decisions:

Async-first -- Benchmark execution, model discovery, and suggestion analysis use asyncio for parallel operations.
Instrumentation via wrapping -- InstrumentedProvider wraps the Sage LiteLLMProvider to capture metrics without modifying the agent runtime.
Deterministic guardrails -- Generated guardrail functions are pure validation logic with no LLM calls at runtime.
Strongly typed -- All data flows through Pydantic models for validation and serialization.

Development

# Run the full quality pipeline (lint + format + type-check + tests)
make test

# Run only tests
make test-only

# Individual checks
make lint          # ruff check with auto-fix
make format        # ruff format
make type-check    # mypy

# Clean build artifacts
make clean

Running Tests

Tests use pytest with pytest-asyncio for async test support and pytest-mock for mocking:

# All tests
uv run pytest -v tests

# Specific test module
uv run pytest -v tests/test_validation/
uv run pytest -v tests/test_benchmark/

# Single test
uv run pytest -v tests/test_cli/test_validate.py -k "test_validate_strict"

CI/CD

The project uses GitHub Actions for continuous integration and release management:

CI (ci.yml) -- Runs make test (lint, format, type-check, pytest) on pull requests to main.
Release (release.yml) -- On push to main or dev, runs Python Semantic Release to version, tag, and publish to PyPI. Pushes to dev produce release candidates.

Project details

Release history Release notifications | RSS feed

1.1.0

Feb 26, 2026

1.0.0

Feb 26, 2026

This version

1.0.0rc3 pre-release

Feb 25, 2026

1.0.0rc2 pre-release

Feb 25, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sage_evaluator-1.0.0rc3.tar.gz (39.1 kB view details)

Uploaded Feb 25, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

sage_evaluator-1.0.0rc3-py3-none-any.whl (50.9 kB view details)

Uploaded Feb 25, 2026 Python 3

File details

Details for the file sage_evaluator-1.0.0rc3.tar.gz.

File metadata

Download URL: sage_evaluator-1.0.0rc3.tar.gz
Upload date: Feb 25, 2026
Size: 39.1 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sage_evaluator-1.0.0rc3.tar.gz
Algorithm	Hash digest
SHA256	`29613551046d73a6855772e9bb0d48931f40b3b79e737d45b61d1034df54e6e0`
MD5	`cfbac6568d32ae40c27ef06b6fe054a3`
BLAKE2b-256	`9d3551503f16618e527f0d8724d7d69341ca03467d77dea8a455032dc11af06c`

See more details on using hashes here.

Provenance

The following attestation bundles were made for sage_evaluator-1.0.0rc3.tar.gz:

Publisher: release.yml on sagebynature/sage-evaluator

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: sage_evaluator-1.0.0rc3.tar.gz
- Subject digest: 29613551046d73a6855772e9bb0d48931f40b3b79e737d45b61d1034df54e6e0
- Sigstore transparency entry: 992411880
- Sigstore integration time: Feb 25, 2026
Source repository:
- Permalink: sagebynature/sage-evaluator@2333e98a33b8c4a87d221bb8e530a0a59c626bde
- Branch / Tag: refs/heads/dev
- Owner: https://github.com/sagebynature
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@2333e98a33b8c4a87d221bb8e530a0a59c626bde
- Trigger Event: push

File details

Details for the file sage_evaluator-1.0.0rc3-py3-none-any.whl.

File metadata

Download URL: sage_evaluator-1.0.0rc3-py3-none-any.whl
Upload date: Feb 25, 2026
Size: 50.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sage_evaluator-1.0.0rc3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c5dc6ef6df8ff72696556681294fea220ededa3c6fb3ac68512fe7532ff9d04e`
MD5	`72b6e27fa8b92523fc8ef32e13154702`
BLAKE2b-256	`ece7508837b94f0f8bee6d4d062e6f5d7faaaed0cfbe63f5a5022a96414c6334`

See more details on using hashes here.

Provenance

The following attestation bundles were made for sage_evaluator-1.0.0rc3-py3-none-any.whl:

Publisher: release.yml on sagebynature/sage-evaluator

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: sage_evaluator-1.0.0rc3-py3-none-any.whl
- Subject digest: c5dc6ef6df8ff72696556681294fea220ededa3c6fb3ac68512fe7532ff9d04e
- Sigstore transparency entry: 992411884
- Sigstore integration time: Feb 25, 2026
Source repository:
- Permalink: sagebynature/sage-evaluator@2333e98a33b8c4a87d221bb8e530a0a59c626bde
- Branch / Tag: refs/heads/dev
- Owner: https://github.com/sagebynature
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@2333e98a33b8c4a87d221bb8e530a0a59c626bde
- Trigger Event: push

sage-evaluator 1.0.0rc3

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Sage Evaluator

Why This Tool Exists

Requirements

Installation

Development Setup

Configuration

Commands

evaluate validate

evaluate discover

evaluate benchmark

evaluate suggest

evaluate compare

Evaluation Rubrics

Custom Rubrics

Architecture

Development

Running Tests

CI/CD

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`evaluate validate`

`evaluate discover`

`evaluate benchmark`

`evaluate suggest`

`evaluate compare`