Skip to main content

CLI tool for validating, benchmarking, and optimizing Sage agent configurations

Project description

Sage Evaluator

CLI tool for validating, benchmarking, and optimizing Sage agent configurations.

Why This Tool Exists

Building effective AI agents is iterative. You write a system prompt, choose a model, wire up tools, and hope for the best. But how do you know if your configuration is correct? How do you pick between GPT-4o and Claude when both "seem fine"? How do you catch the configuration mistake that makes your agent silently worse?

Sage Evaluator exists to bring rigor to that process. It provides four capabilities that address the core challenges of agent development:

  1. Validation -- Catch configuration errors before they reach production. Typos in tool names, missing frontmatter fields, unreachable subagent paths, and suspicious defaults are surfaced immediately instead of manifesting as mysterious runtime failures.

  2. Benchmarking -- Compare models objectively. Rather than gut-checking outputs by hand, run the same agent intent across multiple models and get back token usage, latency, cost estimates, and LLM-as-judge quality scores in a single report.

  3. Suggestion -- Get actionable feedback on your agent configuration. The analyzer identifies prompt improvements, opportunities to extract logic into tools, guardrail candidates, and architectural changes -- then optionally generates the code.

  4. Comparison -- A/B test configuration changes. Run two versions of the same agent against identical conditions and see exactly what changed in quality, cost, and speed.

Requirements

  • Python 3.11+ (< 3.13)
  • uv package manager
  • Azure credentials (for discover and model access via Azure Cognitive Services)

Installation

# Clone and install
git clone <repository-url>
cd sage-evaluator
make install

This runs uv sync --frozen --group dev to install all dependencies including dev tools.

If you need to update dependencies:

make update

Configuration

Create a .env file in the project root (or export these variables):

# Required for Azure model access
AZURE_AI_API_BASE=https://<your-endpoint>.services.ai.azure.com

# Required for the discover command (account name, not full URL)
AZURE_AI_ACCOUNT_NAME=<your-account-name>

# Optional -- used by discover to skip auto-discovery
AZURE_SUBSCRIPTION_ID=<your-subscription-id>
AZURE_RESOURCE_GROUP=<your-resource-group>

# Optional -- defaults to azure_ai/claude-opus-4-6
EVALUATOR_MODEL=azure_ai/claude-opus-4-6

Commands

evaluate validate

Validates agent and skill markdown configuration files. Runs three levels of checks:

  • Structural -- YAML frontmatter parsing and Pydantic model validation
  • Semantic -- Model identifier format, tool name verification against known built-ins, subagent path resolution
  • Best-practice -- Heuristic warnings (default max_turns, short prompt bodies, missing tools)
# Validate a single file
evaluate validate ./my-agent/AGENTS.md

# Validate a directory (looks for AGENTS.md inside)
evaluate validate ./my-agent

# Validate multiple paths
evaluate validate ./agent-a ./agent-b ./shared-skill.md

# Strict mode: treat warnings as errors
evaluate validate ./my-agent --strict

# JSON output (for CI pipelines)
evaluate validate ./my-agent --format json

Exit codes: 0 if all files pass, 1 if any file has errors.

evaluate discover

Lists models deployed in an Azure Cognitive Services account. Useful for seeing what's available before benchmarking. Subscription and resource group are auto-discovered from the account name unless explicitly provided.

# List deployed models (auto-discovers subscription and resource group)
evaluate discover --account-name my-aisvcs-account

# Include per-token pricing
evaluate discover --account-name $AZURE_AI_ACCOUNT_NAME --include-pricing

# Skip auto-discovery by providing subscription and resource group
evaluate discover --account-name $AZURE_AI_ACCOUNT_NAME \
  --subscription $AZURE_SUBSCRIPTION_ID \
  --resource-group $AZURE_RESOURCE_GROUP

# JSON output
evaluate discover --account-name $AZURE_AI_ACCOUNT_NAME --include-pricing --format json
Option Default Description
--account-name (required) Azure Cognitive Services account name (or AZURE_AI_ACCOUNT_NAME env var)
--subscription (auto-discovered) Azure subscription ID (or AZURE_SUBSCRIPTION_ID env var)
--resource-group (auto-discovered) Azure resource group (or AZURE_RESOURCE_GROUP env var)
--include-pricing false Enrich each model with per-token pricing
--format text Output format: text or json

Pricing is resolved through a 3-tier lookup: litellm data, a hard-coded fallback table for common models, or reported as unknown.

evaluate benchmark

Benchmarks an agent configuration across one or more models. This is the core workflow:

  1. Elaborates the user intent via LLM (clarifying expected outcomes and evaluation criteria)
  2. Runs the agent with each specified model (multiple times if --runs > 1)
  3. Captures metrics: token usage, latency, tool calls
  4. Scores outputs using an LLM-as-judge against a rubric
  5. Estimates costs using pricing data
  6. Ranks models by weighted composite score
# Basic benchmark
evaluate benchmark ./my-agent/AGENTS.md \
  --models gpt-4o azure_ai/claude-opus-4-6 \
  --intent "Answer user questions about Python programming"

# Multiple runs with code generation rubric
evaluate benchmark ./my-agent/AGENTS.md \
  --models gpt-4o azure_ai/claude-opus-4-6 \
  --intent "Generate a REST API for a todo app" \
  --rubric code_generation \
  --runs 3

# Skip quality scoring (metrics only)
evaluate benchmark ./my-agent/AGENTS.md \
  --models gpt-4o \
  --intent "Summarize a document" \
  --no-judge

# Save report to file
evaluate benchmark ./my-agent/AGENTS.md \
  --models gpt-4o \
  --intent "Debug a failing test" \
  --output report.json
Option Default Description
--models, -m (required) Model identifiers to benchmark (repeatable)
--intent (required) User intent to benchmark against
--rubric default Built-in name or path to YAML rubric
--evaluator-model azure_ai/claude-opus-4-6 Model for intent elaboration and judging
--runs 1 Number of runs per model
--no-judge false Skip LLM-as-judge evaluation
--account-name (none) Azure Cognitive Services account name for model discovery
--subscription (none) Azure subscription ID (used with --account-name)
--resource-group (none) Azure resource group (used with --account-name)
--output (none) Save report to JSON file
--format text Output format: text or json

evaluate suggest

Analyzes an agent configuration and returns optimization suggestions across four categories:

  • Prompt improvement -- Wording, structure, and clarity changes
  • Tool extraction -- Logic that should be moved from the prompt into @tool functions
  • Guardrail -- Input/output validation that should be enforced programmatically
  • Architecture -- Structural changes (subagent decomposition, model selection, etc.)
# Analyze and get suggestions
evaluate suggest ./my-agent/AGENTS.md

# Generate @tool function code from suggestions
evaluate suggest ./my-agent --generate-tools

# Generate guardrail validation functions
evaluate suggest ./my-agent --generate-guardrails

# Both, with JSON output
evaluate suggest ./my-agent \
  --generate-tools --generate-guardrails \
  --format json --output suggestions.json
Option Default Description
--analyzer-model azure_ai/claude-opus-4-6 Model for analysis
--generate-tools false Generate @tool function code
--generate-guardrails false Generate guardrail validation functions
--output (none) Save report to JSON file
--format text Output format: text or json

evaluate compare

Runs two agent configurations through the same benchmark and produces a side-by-side comparison. Useful for A/B testing configuration changes.

evaluate compare ./agent-v1/AGENTS.md ./agent-v2/AGENTS.md \
  --models gpt-4o azure_ai/claude-opus-4-6 \
  --intent "Answer user questions about Python" \
  --output comparison.json

Accepts the same options as benchmark (--models, --rubric, --evaluator-model, --runs, --no-judge, --account-name, --subscription, --resource-group, --output, --format).

Evaluation Rubrics

The benchmark command scores agent outputs using an LLM-as-judge against a rubric. Three rubrics are built in:

default -- General-purpose evaluation:

Dimension Weight Description
relevance 2.0 How well the output addresses the user's intent
accuracy 2.0 Factual and technical correctness
completeness 1.5 Whether all aspects of the request are covered
clarity 1.0 Structure and readability
efficiency 1.0 Appropriate tool use, no unnecessary steps

code_generation -- For code tasks:

Dimension Weight Description
correctness 2.5 Functional correctness and edge case handling
completeness 2.0 All requested functionality implemented
code_quality 1.5 Naming, structure, DRY principles
security 1.5 Avoids common vulnerabilities
documentation 1.0 Comments and docstrings

qa -- For question-answering:

Dimension Weight Description
accuracy 2.5 Factual correctness
relevance 2.0 Directly addresses the question
depth 1.5 Thoroughness of explanation
source_usage 1.0 Use of tools and references
conciseness 1.0 Avoids unnecessary verbosity

Custom Rubrics

Create a YAML file and pass it via --rubric path/to/rubric.yaml:

name: my_rubric
description: Custom rubric for my use case
dimensions:
  - name: accuracy
    description: Factual correctness of the response
    weight: 2.0
  - name: tone
    description: Professional and helpful tone
    weight: 1.5
  - name: actionability
    description: Provides clear next steps
    weight: 1.0

Architecture

sage_evaluator/
├── cli/
│   └── main.py              # Click CLI with 5 commands
├── validation/
│   └── validator.py          # 3-level config validation
├── benchmark/
│   ├── engine.py             # Orchestrates the benchmark pipeline
│   ├── runner.py             # InstrumentedProvider for metrics capture
│   └── collector.py          # Multi-run metric aggregation
├── evaluation/
│   ├── judge.py              # LLM-as-judge scoring
│   └── rubrics.py            # Built-in and YAML rubric loading
├── discovery/
│   ├── azure_models.py       # Azure AI Foundry model discovery
│   └── pricing.py            # 3-tier pricing lookup
├── suggestion/
│   ├── analyzer.py           # Prompt and config analysis
│   ├── tool_generator.py     # @tool function code generation
│   └── guardrail_generator.py  # Guardrail function generation
├── reporting/
│   ├── terminal.py           # Rich terminal output
│   └── json_export.py        # JSON report serialization
├── models.py                 # All Pydantic data models
└── exceptions.py             # Exception hierarchy

Key design decisions:

  • Async-first -- Benchmark execution, model discovery, and suggestion analysis use asyncio for parallel operations.
  • Instrumentation via wrapping -- InstrumentedProvider wraps the Sage LiteLLMProvider to capture metrics without modifying the agent runtime.
  • Deterministic guardrails -- Generated guardrail functions are pure validation logic with no LLM calls at runtime.
  • Strongly typed -- All data flows through Pydantic models for validation and serialization.

Development

# Run the full quality pipeline (lint + format + type-check + tests)
make test

# Run only tests
make test-only

# Individual checks
make lint          # ruff check with auto-fix
make format        # ruff format
make type-check    # mypy

# Clean build artifacts
make clean

Running Tests

Tests use pytest with pytest-asyncio for async test support and pytest-mock for mocking:

# All tests
uv run pytest -v tests

# Specific test module
uv run pytest -v tests/test_validation/
uv run pytest -v tests/test_benchmark/

# Single test
uv run pytest -v tests/test_cli/test_validate.py -k "test_validate_strict"

Building and Publishing

# Build the package
make build

# Build with a specific version tag
make build TAG=0.2.0

# Publish to Azure Artifacts (requires AZURE_ARTIFACTS_ENV_ACCESS_TOKEN)
make publish

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sage_evaluator-1.0.0rc2.tar.gz (39.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sage_evaluator-1.0.0rc2-py3-none-any.whl (50.8 kB view details)

Uploaded Python 3

File details

Details for the file sage_evaluator-1.0.0rc2.tar.gz.

File metadata

  • Download URL: sage_evaluator-1.0.0rc2.tar.gz
  • Upload date:
  • Size: 39.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sage_evaluator-1.0.0rc2.tar.gz
Algorithm Hash digest
SHA256 fbd3a5822dcae9ca973fea7da58c4f4e7a16e6c1d6d72f27702044b2ee1a4a81
MD5 f7c7642563c2cba33c0579c3d5f92c37
BLAKE2b-256 21eff3931d2bb6e7630cb756b6342e931ff76fda697a59e6928991779efd69e0

See more details on using hashes here.

Provenance

The following attestation bundles were made for sage_evaluator-1.0.0rc2.tar.gz:

Publisher: release.yml on sagebynature/sage-evaluator

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sage_evaluator-1.0.0rc2-py3-none-any.whl.

File metadata

File hashes

Hashes for sage_evaluator-1.0.0rc2-py3-none-any.whl
Algorithm Hash digest
SHA256 399a24bd32270d52cd1b17435cc71051029516e302e894828a905a8d23971560
MD5 0d1121095f6722102b28a5f13982a112
BLAKE2b-256 dff1abdbffaa9eb08801df6522761b4cf15e9980a6a0cc2f213e8fbf952c5a85

See more details on using hashes here.

Provenance

The following attestation bundles were made for sage_evaluator-1.0.0rc2-py3-none-any.whl:

Publisher: release.yml on sagebynature/sage-evaluator

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page