Skip to main content

CLI tool for validating, benchmarking, and optimizing Sage agent configurations

Project description

Sage Evaluator

CLI tool for validating, benchmarking, and optimizing Sage agent configurations.

Why This Tool Exists

Building effective AI agents is iterative. You write a system prompt, choose a model, wire up tools, and hope for the best. But how do you know if your configuration is correct? How do you pick between GPT-4o and Claude when both "seem fine"? How do you catch the configuration mistake that makes your agent silently worse?

Sage Evaluator exists to bring rigor to that process. It provides four capabilities that address the core challenges of agent development:

  1. Validation -- Catch configuration errors before they reach production. Invalid permissions, malformed extension paths, missing frontmatter fields, unreachable subagent paths, and suspicious defaults are surfaced immediately instead of manifesting as mysterious runtime failures.

  2. Benchmarking -- Compare models objectively. Rather than gut-checking outputs by hand, run the same agent intent across multiple models and get back token usage, latency, cost estimates, and LLM-as-judge quality scores in a single report.

  3. Suggestion -- Get actionable feedback on your agent configuration. The analyzer identifies prompt improvements, opportunities to extract logic into tools, guardrail candidates, and architectural changes -- then optionally generates the code.

  4. Comparison -- A/B test configuration changes. Run two versions of the same agent against identical conditions and see exactly what changed in quality, cost, and speed.

Requirements

  • Python 3.10+
  • uv package manager
  • Azure credentials (for discover and model access via Azure Cognitive Services)

Installation

uv tool install sage-evaluator

Development Setup

git clone <repository-url>
cd sage-evaluator
make install

This runs uv sync --frozen --group dev and installs pre-commit hooks for linting and commit message validation.

If you need to update dependencies:

make update

Configuration

Create a .env file in the project root (or export these variables):

# Required for Azure model access
AZURE_AI_API_BASE=https://<your-endpoint>.services.ai.azure.com

# Required for the discover command (account name, not full URL)
AZURE_AI_ACCOUNT_NAME=<your-account-name>

# Optional -- used by discover to skip auto-discovery
AZURE_SUBSCRIPTION_ID=<your-subscription-id>
AZURE_RESOURCE_GROUP=<your-resource-group>

# Optional -- defaults to azure_ai/claude-opus-4-6
EVALUATOR_MODEL=azure_ai/claude-opus-4-6

Commands

evaluate validate

Validates agent and skill markdown configuration files. Runs three levels of checks:

  • Structural -- YAML frontmatter parsing and Pydantic model validation
  • Semantic -- Model identifier format, extension module path verification, permission value validation, subagent path resolution
  • Best-practice -- Heuristic warnings (default max_turns, short prompt bodies, missing permission/extensions)
# Validate a single file
evaluate validate ./my-agent/AGENTS.md

# Validate a directory (looks for AGENTS.md inside)
evaluate validate ./my-agent

# Validate multiple paths
evaluate validate ./agent-a ./agent-b ./shared-skill.md

# Strict mode: treat warnings as errors
evaluate validate ./my-agent --strict

# JSON output (for CI pipelines)
evaluate validate ./my-agent --format json

Exit codes: 0 if all files pass, 1 if any file has errors.

evaluate discover

Lists models deployed in an Azure Cognitive Services account. Useful for seeing what's available before benchmarking. Subscription and resource group are auto-discovered from the account name unless explicitly provided.

# List deployed models (auto-discovers subscription and resource group)
evaluate discover --account-name my-aisvcs-account

# Include per-token pricing
evaluate discover --account-name $AZURE_AI_ACCOUNT_NAME --include-pricing

# Skip auto-discovery by providing subscription and resource group
evaluate discover --account-name $AZURE_AI_ACCOUNT_NAME \
  --subscription $AZURE_SUBSCRIPTION_ID \
  --resource-group $AZURE_RESOURCE_GROUP

# JSON output
evaluate discover --account-name $AZURE_AI_ACCOUNT_NAME --include-pricing --format json
Option Default Description
--account-name (required) Azure Cognitive Services account name (or AZURE_AI_ACCOUNT_NAME env var)
--subscription (auto-discovered) Azure subscription ID (or AZURE_SUBSCRIPTION_ID env var)
--resource-group (auto-discovered) Azure resource group (or AZURE_RESOURCE_GROUP env var)
--include-pricing false Enrich each model with per-token pricing
--format text Output format: text or json

Pricing is resolved through a 3-tier lookup: litellm data, a hard-coded fallback table for common models, or reported as unknown.

evaluate benchmark

Benchmarks an agent configuration across one or more models. This is the core workflow:

  1. Elaborates the user intent via LLM (clarifying expected outcomes and evaluation criteria)
  2. Runs the agent with each specified model (multiple times if --runs > 1)
  3. Captures metrics: token usage, latency, tool calls
  4. Scores outputs using an LLM-as-judge against a rubric
  5. Estimates costs using pricing data
  6. Ranks models by weighted composite score
# Basic benchmark
evaluate benchmark ./my-agent/AGENTS.md \
  --models gpt-4o azure_ai/claude-opus-4-6 \
  --intent "Answer user questions about Python programming"

# Multiple runs with code generation rubric
evaluate benchmark ./my-agent/AGENTS.md \
  --models gpt-4o azure_ai/claude-opus-4-6 \
  --intent "Generate a REST API for a todo app" \
  --rubric code_generation \
  --runs 3

# Skip quality scoring (metrics only)
evaluate benchmark ./my-agent/AGENTS.md \
  --models gpt-4o \
  --intent "Summarize a document" \
  --no-judge

# Save report to file
evaluate benchmark ./my-agent/AGENTS.md \
  --models gpt-4o \
  --intent "Debug a failing test" \
  --output report.json
Option Default Description
--models, -m (required) Model identifiers to benchmark (repeatable)
--intent (required) User intent to benchmark against
--rubric default Built-in name or path to YAML rubric
--runs 1 Number of runs per model
--no-judge false Skip LLM-as-judge evaluation
--account-name (none) Azure Cognitive Services account name for model discovery
--subscription (none) Azure subscription ID (used with --account-name)
--resource-group (none) Azure resource group (used with --account-name)
--output (none) Save report to JSON file
--format text Output format: text or json

evaluate suggest

Analyzes an agent configuration and returns optimization suggestions across four categories:

  • Prompt improvement -- Wording, structure, and clarity changes
  • Tool extraction -- Logic that should be moved from the prompt into @tool functions
  • Guardrail -- Input/output validation that should be enforced programmatically
  • Architecture -- Structural changes (subagent decomposition, model selection, etc.)
# Analyze and get suggestions
evaluate suggest ./my-agent/AGENTS.md

# Generate @tool function code from suggestions
evaluate suggest ./my-agent --generate-tools

# Generate guardrail validation functions
evaluate suggest ./my-agent --generate-guardrails

# Both, with JSON output
evaluate suggest ./my-agent \
  --generate-tools --generate-guardrails \
  --format json --output suggestions.json
Option Default Description
--generate-tools false Generate @tool function code
--generate-guardrails false Generate guardrail validation functions
--output (none) Save report to JSON file
--format text Output format: text or json

evaluate compare

Runs two agent configurations through the same benchmark and produces a side-by-side comparison. Useful for A/B testing configuration changes.

evaluate compare ./agent-v1/AGENTS.md ./agent-v2/AGENTS.md \
  --models gpt-4o azure_ai/claude-opus-4-6 \
  --intent "Answer user questions about Python" \
  --output comparison.json

Accepts the same options as benchmark (--models, --rubric, --runs, --no-judge, --account-name, --subscription, --resource-group, --output, --format).

Evaluation Rubrics

The benchmark command scores agent outputs using an LLM-as-judge against a rubric. Three rubrics are built in:

default -- General-purpose evaluation:

Dimension Weight Description
relevance 2.0 How well the output addresses the user's intent
accuracy 2.0 Factual and technical correctness
completeness 1.5 Whether all aspects of the request are covered
clarity 1.0 Structure and readability
efficiency 1.0 Appropriate tool use, no unnecessary steps

code_generation -- For code tasks:

Dimension Weight Description
correctness 2.5 Functional correctness and edge case handling
completeness 2.0 All requested functionality implemented
code_quality 1.5 Naming, structure, DRY principles
security 1.5 Avoids common vulnerabilities
documentation 1.0 Comments and docstrings

qa -- For question-answering:

Dimension Weight Description
accuracy 2.5 Factual correctness
relevance 2.0 Directly addresses the question
depth 1.5 Thoroughness of explanation
source_usage 1.0 Use of tools and references
conciseness 1.0 Avoids unnecessary verbosity

Custom Rubrics

Create a YAML file and pass it via --rubric path/to/rubric.yaml:

name: my_rubric
description: Custom rubric for my use case
dimensions:
  - name: accuracy
    description: Factual correctness of the response
    weight: 2.0
  - name: tone
    description: Professional and helpful tone
    weight: 1.5
  - name: actionability
    description: Provides clear next steps
    weight: 1.0

Architecture

skills/                         # Sage agent skills
├── create-sage-agent/
│   └── SKILL.md                # Agent scaffolding skill
└── evaluate-sage-agent/
    └── SKILL.md                # Agent evaluation skill
sage_evaluator/
├── cli/
│   └── main.py              # Click CLI with 5 commands
├── validation/
│   └── validator.py          # 3-level config validation
├── benchmark/
│   ├── engine.py             # Orchestrates the benchmark pipeline
│   ├── runner.py             # InstrumentedProvider for metrics capture
│   └── collector.py          # Multi-run metric aggregation
├── evaluation/
│   ├── judge.py              # LLM-as-judge scoring
│   └── rubrics.py            # Built-in and YAML rubric loading
├── discovery/
│   ├── azure_models.py       # Azure AI Foundry model discovery
│   └── pricing.py            # 3-tier pricing lookup
├── suggestion/
│   ├── analyzer.py           # Prompt and config analysis
│   ├── tool_generator.py     # @tool function code generation
│   └── guardrail_generator.py  # Guardrail function generation
├── reporting/
│   ├── terminal.py           # Rich terminal output
│   └── json_export.py        # JSON report serialization
├── models.py                 # All Pydantic data models
└── exceptions.py             # Exception hierarchy

Key design decisions:

  • Async-first -- Benchmark execution, model discovery, and suggestion analysis use asyncio for parallel operations.
  • Instrumentation via wrapping -- InstrumentedProvider wraps the Sage LiteLLMProvider to capture metrics without modifying the agent runtime.
  • Deterministic guardrails -- Generated guardrail functions are pure validation logic with no LLM calls at runtime.
  • Strongly typed -- All data flows through Pydantic models for validation and serialization.

Skills

Sage Evaluator ships with two skills that can be loaded by any Sage agent to automate the evaluation workflow:

create-sage-agent

Scaffolds a new agent from a natural language description. Walks the user through:

  1. Describing what the agent should do
  2. Analyzing the intent to infer permissions, model, max_turns, and subagent opportunities
  3. Generating a complete AGENTS.md with a tailored system prompt
  4. Validating the result
  5. Optionally handing off to evaluate-sage-agent for optimization

evaluate-sage-agent

Runs a full evaluation pipeline on an existing agent configuration:

  1. Validates the config format
  2. Determines the appropriate rubric (code_generation, qa, or default) from the agent's purpose
  3. Generates optimization suggestions grouped by category
  4. Offers to apply suggestions with versioned backups (AGENTS.v1.md, v2.md, etc.)
  5. Benchmarks against the agent's model (with optional model comparison)
  6. Compares before/after if changes were applied

Using Skills

To use these skills with a Sage agent, point the agent's skills_dir to the evaluator's skills directory:

---
name: my-agent
model: azure_ai/gpt-4o
skills_dir: /path/to/sage-evaluator/skills
---

Or copy the skill files into your agent's own skills directory:

cp -r /path/to/sage-evaluator/skills/create-sage-agent ./my-agent/skills/
cp -r /path/to/sage-evaluator/skills/evaluate-sage-agent ./my-agent/skills/

The skills auto-install sage-evaluator via uv pip install if the evaluate CLI is not available.

Development

# Run the full quality pipeline (lint + format + type-check + tests)
make test

# Run only tests
make test-only

# Individual checks
make lint          # ruff check with auto-fix
make format        # ruff format
make type-check    # mypy

# Clean build artifacts
make clean

Running Tests

Tests use pytest with pytest-asyncio for async test support and pytest-mock for mocking:

# All tests
uv run pytest -v tests

# Specific test module
uv run pytest -v tests/test_validation/
uv run pytest -v tests/test_benchmark/

# Single test
uv run pytest -v tests/test_cli/test_validate.py -k "test_validate_strict"

CI/CD

The project uses GitHub Actions for continuous integration and release management:

  • CI (ci.yml) -- Runs make test (lint, format, type-check, pytest) on pull requests to main.
  • Release (release.yml) -- On push to main or dev, runs Python Semantic Release to version, tag, and publish to PyPI. Pushes to dev produce release candidates.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sage_evaluator-1.1.0.tar.gz (40.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sage_evaluator-1.1.0-py3-none-any.whl (51.7 kB view details)

Uploaded Python 3

File details

Details for the file sage_evaluator-1.1.0.tar.gz.

File metadata

  • Download URL: sage_evaluator-1.1.0.tar.gz
  • Upload date:
  • Size: 40.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sage_evaluator-1.1.0.tar.gz
Algorithm Hash digest
SHA256 74bdf7d6efb9203e3593405e7a32b5c0e1c922fe95d9bd6f33f1b955990e4d6e
MD5 f1769b3c9a8fd7654fe974125f95cf5a
BLAKE2b-256 826b9d147c50b1e14e0b9d1d630461e92fe735fa7e72db678ac00cb3cb17b45a

See more details on using hashes here.

Provenance

The following attestation bundles were made for sage_evaluator-1.1.0.tar.gz:

Publisher: release.yml on sagebynature/sage-evaluator

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sage_evaluator-1.1.0-py3-none-any.whl.

File metadata

  • Download URL: sage_evaluator-1.1.0-py3-none-any.whl
  • Upload date:
  • Size: 51.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sage_evaluator-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a969292bb836519f2eb2d4290629d759bbabad84e571ccc21f7c082d4c6c1b71
MD5 4fd8446e158126842db2210a54b4cdc5
BLAKE2b-256 d4d2986a0aee2a21b89dc0efd40824c14195edb08e9da9f5607e675842ffc651

See more details on using hashes here.

Provenance

The following attestation bundles were made for sage_evaluator-1.1.0-py3-none-any.whl:

Publisher: release.yml on sagebynature/sage-evaluator

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page