CLI tool for validating, benchmarking, and optimizing Sage agent configurations
Project description
Sage Evaluator
CLI tool for validating, benchmarking, and optimizing Sage agent configurations.
Why This Tool Exists
Building effective AI agents is iterative. You write a system prompt, choose a model, wire up tools, and hope for the best. But how do you know if your configuration is correct? How do you pick between GPT-4o and Claude when both "seem fine"? How do you catch the configuration mistake that makes your agent silently worse?
Sage Evaluator exists to bring rigor to that process. It provides four capabilities that address the core challenges of agent development:
-
Validation -- Catch configuration errors before they reach production. Typos in tool names, missing frontmatter fields, unreachable subagent paths, and suspicious defaults are surfaced immediately instead of manifesting as mysterious runtime failures.
-
Benchmarking -- Compare models objectively. Rather than gut-checking outputs by hand, run the same agent intent across multiple models and get back token usage, latency, cost estimates, and LLM-as-judge quality scores in a single report.
-
Suggestion -- Get actionable feedback on your agent configuration. The analyzer identifies prompt improvements, opportunities to extract logic into tools, guardrail candidates, and architectural changes -- then optionally generates the code.
-
Comparison -- A/B test configuration changes. Run two versions of the same agent against identical conditions and see exactly what changed in quality, cost, and speed.
Requirements
- Python 3.10+
- uv package manager
- Azure credentials (for
discoverand model access via Azure Cognitive Services)
Installation
uv tool install sage-evaluator
Development Setup
git clone <repository-url>
cd sage-evaluator
make install
This runs uv sync --frozen --group dev and installs pre-commit hooks for linting and commit message validation.
If you need to update dependencies:
make update
Configuration
Create a .env file in the project root (or export these variables):
# Required for Azure model access
AZURE_AI_API_BASE=https://<your-endpoint>.services.ai.azure.com
# Required for the discover command (account name, not full URL)
AZURE_AI_ACCOUNT_NAME=<your-account-name>
# Optional -- used by discover to skip auto-discovery
AZURE_SUBSCRIPTION_ID=<your-subscription-id>
AZURE_RESOURCE_GROUP=<your-resource-group>
# Optional -- defaults to azure_ai/claude-opus-4-6
EVALUATOR_MODEL=azure_ai/claude-opus-4-6
Commands
evaluate validate
Validates agent and skill markdown configuration files. Runs three levels of checks:
- Structural -- YAML frontmatter parsing and Pydantic model validation
- Semantic -- Model identifier format, tool name verification against known built-ins, subagent path resolution
- Best-practice -- Heuristic warnings (default
max_turns, short prompt bodies, missing tools)
# Validate a single file
evaluate validate ./my-agent/AGENTS.md
# Validate a directory (looks for AGENTS.md inside)
evaluate validate ./my-agent
# Validate multiple paths
evaluate validate ./agent-a ./agent-b ./shared-skill.md
# Strict mode: treat warnings as errors
evaluate validate ./my-agent --strict
# JSON output (for CI pipelines)
evaluate validate ./my-agent --format json
Exit codes: 0 if all files pass, 1 if any file has errors.
evaluate discover
Lists models deployed in an Azure Cognitive Services account. Useful for seeing what's available before benchmarking. Subscription and resource group are auto-discovered from the account name unless explicitly provided.
# List deployed models (auto-discovers subscription and resource group)
evaluate discover --account-name my-aisvcs-account
# Include per-token pricing
evaluate discover --account-name $AZURE_AI_ACCOUNT_NAME --include-pricing
# Skip auto-discovery by providing subscription and resource group
evaluate discover --account-name $AZURE_AI_ACCOUNT_NAME \
--subscription $AZURE_SUBSCRIPTION_ID \
--resource-group $AZURE_RESOURCE_GROUP
# JSON output
evaluate discover --account-name $AZURE_AI_ACCOUNT_NAME --include-pricing --format json
| Option | Default | Description |
|---|---|---|
--account-name |
(required) | Azure Cognitive Services account name (or AZURE_AI_ACCOUNT_NAME env var) |
--subscription |
(auto-discovered) | Azure subscription ID (or AZURE_SUBSCRIPTION_ID env var) |
--resource-group |
(auto-discovered) | Azure resource group (or AZURE_RESOURCE_GROUP env var) |
--include-pricing |
false |
Enrich each model with per-token pricing |
--format |
text |
Output format: text or json |
Pricing is resolved through a 3-tier lookup: litellm data, a hard-coded fallback table for common models, or reported as unknown.
evaluate benchmark
Benchmarks an agent configuration across one or more models. This is the core workflow:
- Elaborates the user intent via LLM (clarifying expected outcomes and evaluation criteria)
- Runs the agent with each specified model (multiple times if
--runs > 1) - Captures metrics: token usage, latency, tool calls
- Scores outputs using an LLM-as-judge against a rubric
- Estimates costs using pricing data
- Ranks models by weighted composite score
# Basic benchmark
evaluate benchmark ./my-agent/AGENTS.md \
--models gpt-4o azure_ai/claude-opus-4-6 \
--intent "Answer user questions about Python programming"
# Multiple runs with code generation rubric
evaluate benchmark ./my-agent/AGENTS.md \
--models gpt-4o azure_ai/claude-opus-4-6 \
--intent "Generate a REST API for a todo app" \
--rubric code_generation \
--runs 3
# Skip quality scoring (metrics only)
evaluate benchmark ./my-agent/AGENTS.md \
--models gpt-4o \
--intent "Summarize a document" \
--no-judge
# Save report to file
evaluate benchmark ./my-agent/AGENTS.md \
--models gpt-4o \
--intent "Debug a failing test" \
--output report.json
| Option | Default | Description |
|---|---|---|
--models, -m |
(required) | Model identifiers to benchmark (repeatable) |
--intent |
(required) | User intent to benchmark against |
--rubric |
default |
Built-in name or path to YAML rubric |
--evaluator-model |
azure_ai/claude-opus-4-6 |
Model for intent elaboration and judging |
--runs |
1 |
Number of runs per model |
--no-judge |
false |
Skip LLM-as-judge evaluation |
--account-name |
(none) | Azure Cognitive Services account name for model discovery |
--subscription |
(none) | Azure subscription ID (used with --account-name) |
--resource-group |
(none) | Azure resource group (used with --account-name) |
--output |
(none) | Save report to JSON file |
--format |
text |
Output format: text or json |
evaluate suggest
Analyzes an agent configuration and returns optimization suggestions across four categories:
- Prompt improvement -- Wording, structure, and clarity changes
- Tool extraction -- Logic that should be moved from the prompt into
@toolfunctions - Guardrail -- Input/output validation that should be enforced programmatically
- Architecture -- Structural changes (subagent decomposition, model selection, etc.)
# Analyze and get suggestions
evaluate suggest ./my-agent/AGENTS.md
# Generate @tool function code from suggestions
evaluate suggest ./my-agent --generate-tools
# Generate guardrail validation functions
evaluate suggest ./my-agent --generate-guardrails
# Both, with JSON output
evaluate suggest ./my-agent \
--generate-tools --generate-guardrails \
--format json --output suggestions.json
| Option | Default | Description |
|---|---|---|
--analyzer-model |
azure_ai/claude-opus-4-6 |
Model for analysis |
--generate-tools |
false |
Generate @tool function code |
--generate-guardrails |
false |
Generate guardrail validation functions |
--output |
(none) | Save report to JSON file |
--format |
text |
Output format: text or json |
evaluate compare
Runs two agent configurations through the same benchmark and produces a side-by-side comparison. Useful for A/B testing configuration changes.
evaluate compare ./agent-v1/AGENTS.md ./agent-v2/AGENTS.md \
--models gpt-4o azure_ai/claude-opus-4-6 \
--intent "Answer user questions about Python" \
--output comparison.json
Accepts the same options as benchmark (--models, --rubric, --evaluator-model, --runs, --no-judge, --account-name, --subscription, --resource-group, --output, --format).
Evaluation Rubrics
The benchmark command scores agent outputs using an LLM-as-judge against a rubric. Three rubrics are built in:
default -- General-purpose evaluation:
| Dimension | Weight | Description |
|---|---|---|
| relevance | 2.0 | How well the output addresses the user's intent |
| accuracy | 2.0 | Factual and technical correctness |
| completeness | 1.5 | Whether all aspects of the request are covered |
| clarity | 1.0 | Structure and readability |
| efficiency | 1.0 | Appropriate tool use, no unnecessary steps |
code_generation -- For code tasks:
| Dimension | Weight | Description |
|---|---|---|
| correctness | 2.5 | Functional correctness and edge case handling |
| completeness | 2.0 | All requested functionality implemented |
| code_quality | 1.5 | Naming, structure, DRY principles |
| security | 1.5 | Avoids common vulnerabilities |
| documentation | 1.0 | Comments and docstrings |
qa -- For question-answering:
| Dimension | Weight | Description |
|---|---|---|
| accuracy | 2.5 | Factual correctness |
| relevance | 2.0 | Directly addresses the question |
| depth | 1.5 | Thoroughness of explanation |
| source_usage | 1.0 | Use of tools and references |
| conciseness | 1.0 | Avoids unnecessary verbosity |
Custom Rubrics
Create a YAML file and pass it via --rubric path/to/rubric.yaml:
name: my_rubric
description: Custom rubric for my use case
dimensions:
- name: accuracy
description: Factual correctness of the response
weight: 2.0
- name: tone
description: Professional and helpful tone
weight: 1.5
- name: actionability
description: Provides clear next steps
weight: 1.0
Architecture
sage_evaluator/
├── cli/
│ └── main.py # Click CLI with 5 commands
├── validation/
│ └── validator.py # 3-level config validation
├── benchmark/
│ ├── engine.py # Orchestrates the benchmark pipeline
│ ├── runner.py # InstrumentedProvider for metrics capture
│ └── collector.py # Multi-run metric aggregation
├── evaluation/
│ ├── judge.py # LLM-as-judge scoring
│ └── rubrics.py # Built-in and YAML rubric loading
├── discovery/
│ ├── azure_models.py # Azure AI Foundry model discovery
│ └── pricing.py # 3-tier pricing lookup
├── suggestion/
│ ├── analyzer.py # Prompt and config analysis
│ ├── tool_generator.py # @tool function code generation
│ └── guardrail_generator.py # Guardrail function generation
├── reporting/
│ ├── terminal.py # Rich terminal output
│ └── json_export.py # JSON report serialization
├── models.py # All Pydantic data models
└── exceptions.py # Exception hierarchy
Key design decisions:
- Async-first -- Benchmark execution, model discovery, and suggestion analysis use
asynciofor parallel operations. - Instrumentation via wrapping --
InstrumentedProviderwraps the SageLiteLLMProviderto capture metrics without modifying the agent runtime. - Deterministic guardrails -- Generated guardrail functions are pure validation logic with no LLM calls at runtime.
- Strongly typed -- All data flows through Pydantic models for validation and serialization.
Development
# Run the full quality pipeline (lint + format + type-check + tests)
make test
# Run only tests
make test-only
# Individual checks
make lint # ruff check with auto-fix
make format # ruff format
make type-check # mypy
# Clean build artifacts
make clean
Running Tests
Tests use pytest with pytest-asyncio for async test support and pytest-mock for mocking:
# All tests
uv run pytest -v tests
# Specific test module
uv run pytest -v tests/test_validation/
uv run pytest -v tests/test_benchmark/
# Single test
uv run pytest -v tests/test_cli/test_validate.py -k "test_validate_strict"
CI/CD
The project uses GitHub Actions for continuous integration and release management:
- CI (
ci.yml) -- Runsmake test(lint, format, type-check, pytest) on pull requests tomain. - Release (
release.yml) -- On push tomainordev, runs Python Semantic Release to version, tag, and publish to PyPI. Pushes todevproduce release candidates.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sage_evaluator-1.0.0rc3.tar.gz.
File metadata
- Download URL: sage_evaluator-1.0.0rc3.tar.gz
- Upload date:
- Size: 39.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
29613551046d73a6855772e9bb0d48931f40b3b79e737d45b61d1034df54e6e0
|
|
| MD5 |
cfbac6568d32ae40c27ef06b6fe054a3
|
|
| BLAKE2b-256 |
9d3551503f16618e527f0d8724d7d69341ca03467d77dea8a455032dc11af06c
|
Provenance
The following attestation bundles were made for sage_evaluator-1.0.0rc3.tar.gz:
Publisher:
release.yml on sagebynature/sage-evaluator
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
sage_evaluator-1.0.0rc3.tar.gz -
Subject digest:
29613551046d73a6855772e9bb0d48931f40b3b79e737d45b61d1034df54e6e0 - Sigstore transparency entry: 992411880
- Sigstore integration time:
-
Permalink:
sagebynature/sage-evaluator@2333e98a33b8c4a87d221bb8e530a0a59c626bde -
Branch / Tag:
refs/heads/dev - Owner: https://github.com/sagebynature
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@2333e98a33b8c4a87d221bb8e530a0a59c626bde -
Trigger Event:
push
-
Statement type:
File details
Details for the file sage_evaluator-1.0.0rc3-py3-none-any.whl.
File metadata
- Download URL: sage_evaluator-1.0.0rc3-py3-none-any.whl
- Upload date:
- Size: 50.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c5dc6ef6df8ff72696556681294fea220ededa3c6fb3ac68512fe7532ff9d04e
|
|
| MD5 |
72b6e27fa8b92523fc8ef32e13154702
|
|
| BLAKE2b-256 |
ece7508837b94f0f8bee6d4d062e6f5d7faaaed0cfbe63f5a5022a96414c6334
|
Provenance
The following attestation bundles were made for sage_evaluator-1.0.0rc3-py3-none-any.whl:
Publisher:
release.yml on sagebynature/sage-evaluator
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
sage_evaluator-1.0.0rc3-py3-none-any.whl -
Subject digest:
c5dc6ef6df8ff72696556681294fea220ededa3c6fb3ac68512fe7532ff9d04e - Sigstore transparency entry: 992411884
- Sigstore integration time:
-
Permalink:
sagebynature/sage-evaluator@2333e98a33b8c4a87d221bb8e530a0a59c626bde -
Branch / Tag:
refs/heads/dev - Owner: https://github.com/sagebynature
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@2333e98a33b8c4a87d221bb8e530a0a59c626bde -
Trigger Event:
push
-
Statement type: