Evaluation framework for Strands

These details have not been verified by PyPI

Project description

Strands Evals SDK

A comprehensive evaluation framework for AI agents and LLM applications.

Documentation ◆ Samples ◆ Python SDK ◆ Typescript SDK ◆ Tools ◆ Evaluations

Strands Evaluation is a powerful framework for evaluating AI agents and LLM applications. From simple output validation to complex multi-agent interaction analysis, trajectory evaluation, and automated experiment generation, Strands Evaluation provides comprehensive tools to measure and improve your AI systems.

Feature Overview

Multiple Evaluation Types: Output evaluation, trajectory analysis, tool usage assessment, and interaction evaluation
Multimodal Evaluation: MLLM-as-a-Judge evaluators for image-to-text tasks with built-in rubrics
Dynamic Simulators: Multi-turn conversation simulation with realistic user behavior, goal-oriented interactions, and LLM-powered tool simulation with shared state
LLM-as-a-Judge: Built-in evaluators using language models for sophisticated assessment with structured scoring
Trace-based Evaluation: Analyze agent behavior through OpenTelemetry execution traces
Automated Experiment Generation: Generate comprehensive test suites from context descriptions
Custom Evaluators: Extensible framework for domain-specific evaluation logic
Experiment Management: Save, load, and version your evaluation experiments with JSON serialization
Built-in Scoring Tools: Helper functions for exact, in-order, and any-order trajectory matching
Failure Detection & Root Cause Analysis: Automatically detect failures in agent sessions and diagnose root causes with actionable fix recommendations
Chaos Testing: Deterministic fault injection via Strands plugin hooks — simulate tool timeouts, network errors, and response corruption to evaluate agent resilience
Red Team Evaluation: Adversarial safety testing with built-in attack strategies (Crescendo, GOAT, PAIR, BadLikertJudge, SequentialBreak); see src/strands_evals/experimental/redteam/README.md

Quick Start

# Install Strands Evals SDK
pip install strands-agents-evals

from strands import Agent
from strands_evals import Case, Experiment
from strands_evals.evaluators import OutputEvaluator

# Create test cases
test_cases = [
    Case[str, str](
        name="knowledge-1",
        input="What is the capital of France?",
        expected_output="The capital of France is Paris.",
        metadata={"category": "knowledge"}
    )
]

# Create evaluators with custom rubric
evaluators = [
    OutputEvaluator(
        rubric="""
        Evaluate based on:
        1. Accuracy - Is the information correct?
        2. Completeness - Does it fully answer the question?
        3. Clarity - Is it easy to understand?
        
        Score 1.0 if all criteria are met excellently.
        Score 0.5 if some criteria are partially met.
        Score 0.0 if the response is inadequate.
        """
    )
]

# Create experiment and run evaluation
experiment = Experiment[str, str](cases=test_cases, evaluators=evaluators)

def get_response(case: Case) -> str:
    agent = Agent(callback_handler=None)
    return str(agent(case.input))

# Run evaluations
report = experiment.run_evaluations(get_response)
report.run_display()

Command-Line Interface

Installing strands-agents-evals also installs the strands-evals console script, a thin wrapper over the Python API for CI and one-off use. It has five subcommands:

Command	Purpose
`strands-evals run`	Execute an Experiment against an `--agent` factory or `--task` callable, or run a single ad-hoc case via `--input` + `--evaluator`/`--expected-output`/`--rubric`.
`strands-evals validate`	Schema-check a serialized Experiment JSON file (CI gate before `run`).
`strands-evals report`	Render an existing `EvaluationReport` JSON via Rich, or dump it as JSON.
`strands-evals diagnose`	Run `detect_failures`, `analyze_root_cause`, or the full `diagnose_session` pipeline on a Session JSON file.
`strands-evals generate`	Synthesize an Experiment via `ExperimentGenerator` from a free-form `--context` or an existing `--experiment` file.

Common flows:

# Schema-check, then run an experiment file against an agent factory
strands-evals validate experiments/customer_service.json
strands-evals run experiments/customer_service.json \
  --agent my_pkg.agents:build_agent \
  --display

# One-off ad-hoc run with no experiment file
strands-evals run \
  --input "What is the capital of France?" \
  --expected-output "Paris" \
  --agent my_pkg.agents:build_agent

# Diagnose a failing session captured to JSON
strands-evals diagnose session.json --confidence medium

# Generate a starter experiment from a tools description
strands-evals generate --context "$(cat tools.txt)" --num-cases 10 \
  --evaluator TrajectoryEvaluator -o experiments/generated.json

Run any subcommand with --help for the full flag set (custom evaluators via MODULE:CLASS, trace attributes, --fail-on exit-code rules, output formats, etc.).

Installation

Ensure you have Python 3.10+ installed, then:

# Create and activate virtual environment
python -m venv .venv
source .venv/bin/activate  # On Windows use: .venv\Scripts\activate

# Install in development mode
pip install -e .

# Install with test dependencies
pip install -e ".[test]"

# Install with both test and dev dependencies
pip install -e ".[test,dev]"

Features at a Glance

Output Evaluation with Custom Rubrics

Evaluate agent responses using LLM-as-a-judge with flexible scoring criteria:

from strands_evals.evaluators import OutputEvaluator

evaluator = OutputEvaluator(
    rubric="Score 1.0 for accurate, complete responses. Score 0.5 for partial answers. Score 0.0 for incorrect or unhelpful responses.",
    include_inputs=True,  # Include context in evaluation
    model="global.anthropic.claude-sonnet-4-6"  # Custom judge model
)

Trajectory Evaluation with Built-in Scoring

Analyze agent tool usage and action sequences with helper scoring functions:

from strands_evals.evaluators import TrajectoryEvaluator
from strands_evals.extractors import tools_use_extractor
from strands_tools import calculator

def get_response_with_tools(case: Case) -> dict:
    agent = Agent(tools=[calculator])
    response = agent(case.input)
    
    # Extract trajectory efficiently to prevent context overflow
    trajectory = tools_use_extractor.extract_agent_tools_used_from_messages(agent.messages)
    
    # Update evaluator with tool descriptions
    evaluator.update_trajectory_description(
        tools_use_extractor.extract_tools_description(agent, is_short=True)
    )
    
    return {"output": str(response), "trajectory": trajectory}

# Evaluator includes built-in scoring tools: exact_match_scorer, in_order_match_scorer, any_order_match_scorer
evaluator = TrajectoryEvaluator(
    rubric="Score 1.0 if correct tools used in proper sequence. Use scoring tools to verify trajectory matches."
)

Trace-based Helpfulness Evaluation

Evaluate agent helpfulness using OpenTelemetry traces with seven-level scoring:

from strands_evals.evaluators import HelpfulnessEvaluator
from strands_evals.telemetry import StrandsEvalsTelemetry
from strands_evals.mappers import StrandsInMemorySessionMapper

# Setup telemetry for trace capture
telemetry = StrandsEvalsTelemetry().setup_in_memory_exporter()

def user_task_function(case: Case) -> dict:
    telemetry.memory_exporter.clear()
    
    agent = Agent(
        trace_attributes={"session.id": case.session_id},
        callback_handler=None
    )
    response = agent(case.input)
    
    # Map spans to session for evaluation
    spans = telemetry.memory_exporter.get_finished_spans()
    mapper = StrandsInMemorySessionMapper()
    session = mapper.map_to_session(spans, session_id=case.session_id)
    
    return {"output": str(response), "trajectory": session}

# Seven-level scoring: Not helpful (0.0) to Above and beyond (1.0)
evaluators = [HelpfulnessEvaluator()]
experiment = Experiment[str, str](cases=test_cases, evaluators=evaluators)

# Run evaluations
report = experiment.run_evaluations(user_task_function)
report.run_display()

Multi-turn Conversation Simulation

Simulate realistic user interactions with dynamic, goal-oriented conversations using ActorSimulator:

from strands import Agent
from strands_evals import Case, Experiment, ActorSimulator
from strands_evals.evaluators import HelpfulnessEvaluator, GoalSuccessRateEvaluator
from strands_evals.mappers import StrandsInMemorySessionMapper
from strands_evals.telemetry import StrandsEvalsTelemetry

# Setup telemetry
telemetry = StrandsEvalsTelemetry().setup_in_memory_exporter()
memory_exporter = telemetry.in_memory_exporter

def task_function(case: Case) -> dict:
    # Create simulator to drive conversation
    simulator = ActorSimulator.from_case_for_user_simulator(
        case=case,
        max_turns=10
    )

    # Create agent to evaluate
    agent = Agent(
        trace_attributes={
            "gen_ai.conversation.id": case.session_id,
            "session.id": case.session_id
        },
        callback_handler=None
    )

    # Run multi-turn conversation
    all_spans = []
    user_message = case.input

    while simulator.has_next():
        memory_exporter.clear()
        agent_response = agent(user_message)
        turn_spans = list(memory_exporter.get_finished_spans())
        all_spans.extend(turn_spans)

        user_result = simulator.act(str(agent_response))
        user_message = str(user_result.structured_output.message)

    # Map to session for evaluation
    mapper = StrandsInMemorySessionMapper()
    session = mapper.map_to_session(all_spans, session_id=case.session_id)

    return {"output": str(agent_response), "trajectory": session}

# Use evaluators to assess simulated conversations
evaluators = [
    HelpfulnessEvaluator(),
    GoalSuccessRateEvaluator()
]

experiment = Experiment(cases=test_cases, evaluators=evaluators)
report = experiment.run_evaluations(task_function)

Key Benefits:

Dynamic Interactions: Simulator adapts responses based on agent behavior
Goal-Oriented Testing: Verify agents can complete user objectives through dialogue
Realistic Conversations: Generate authentic multi-turn interaction patterns
No Predefined Scripts: Test agents without hardcoded conversation paths
Comprehensive Evaluation: Combine with trace-based evaluators for full assessment

Tool Simulation

Simulate tool behavior with LLM-powered responses for controlled agent evaluation using ToolSimulator. Register tools with a decorator, define output schemas, and optionally share state across related tools — the simulator replaces real execution with realistic, schema-validated responses:

from typing import Any
from enum import Enum
from pydantic import BaseModel, Field
from strands import Agent
from strands_evals import Case, Experiment
from strands_evals.evaluators import GoalSuccessRateEvaluator
from strands_evals.simulation.tool_simulator import ToolSimulator

tool_simulator = ToolSimulator()

# Define output schema
class HVACMode(str, Enum):
    HEAT = "heat"
    COOL = "cool"
    AUTO = "auto"
    OFF = "off"

class HVACResponse(BaseModel):
    temperature: float = Field(..., description="Target temperature in Fahrenheit")
    mode: HVACMode = Field(..., description="HVAC mode")
    status: str = Field(default="success", description="Operation status")

# Register tool — the function body is never called; the LLM generates responses
@tool_simulator.tool(
    share_state_id="room_environment",
    initial_state_description="Room: 68°F, humidity 45%, HVAC off",
    output_schema=HVACResponse,
)
def hvac_controller(temperature: float, mode: str) -> dict[str, Any]:
    """Control heating/cooling system that affects room temperature and humidity."""
    pass

def task_function(case: Case) -> dict:
    hvac_tool = tool_simulator.get_tool("hvac_controller")
    agent = Agent(tools=[hvac_tool], callback_handler=None)
    response = agent(case.input)
    return {"output": str(response)}

cases = [Case(name="heat_control", input="Turn on the heat to 72 degrees")]
experiment = Experiment(cases=cases, evaluators=[GoalSuccessRateEvaluator()])
report = experiment.run_evaluations(task_function)

Key Benefits:

No Real Infrastructure: Test tool-using agents without live APIs, databases, or services
Schema-Validated Responses: Pydantic output schemas ensure structured, consistent tool responses
Shared State: Related tools (e.g., sensor + controller) share state via share_state_id for coherent behavior
Stateful Context: Call history and initial state are passed to the LLM for consistent multi-call sequences
Drop-in Replacement: Simulated tools plug directly into Strands Agent via get_tool()

Automated Experiment Generation

Generate comprehensive test suites automatically from context descriptions:

from strands_evals.generators import ExperimentGenerator
from strands_evals.evaluators import TrajectoryEvaluator

# Define available tools and context
tool_context = """
Available tools:
- calculator(expression: str) -> float: Evaluate mathematical expressions
- web_search(query: str) -> str: Search the web for information
- file_read(path: str) -> str: Read file contents
"""

# Generate experiment with multiple test cases
generator = ExperimentGenerator[str, str](str, str)
experiment = await generator.from_context_async(
    context=tool_context,
    num_cases=10,
    evaluator=TrajectoryEvaluator,
    task_description="Math and research assistant with tool usage",
    num_topics=3  # Distribute cases across multiple topics
)

# Save generated experiment
experiment.to_file("generated_experiment", "json")

Failure Detection & Diagnosis

Detect failures in agent sessions and analyze root causes to get actionable fix recommendations:

from strands_evals import Case, DiagnosisConfig, Experiment
from strands_evals.evaluators import HelpfulnessEvaluator
from strands_evals.types.detector import ConfidenceLevel, DiagnosisTrigger

# Enable diagnosis on failing cases — runs failure detection then root cause analysis
experiment = Experiment(
    cases=test_cases,
    evaluators=[HelpfulnessEvaluator()],
    diagnosis_config=DiagnosisConfig(
        trigger=DiagnosisTrigger.ON_FAILURE,           # ON_FAILURE or ALWAYS
        confidence_threshold=ConfidenceLevel.MEDIUM,   # LOW, MEDIUM, or HIGH
    ),
)

report = experiment.run_evaluations(task_function)

# Display results with recommendations
report.display(include_recommendations=True)

You can also use the detectors standalone on any Session object (strands_evals.types.trace.Session):

from strands_evals.detectors import detect_failures, analyze_root_cause, diagnose_session
from strands_evals.types.detector import ConfidenceLevel

# Full pipeline: detect failures → root cause analysis
result = diagnose_session(session, confidence_threshold=ConfidenceLevel.MEDIUM)

for rc in result.root_causes:
    print(f"{rc.fix_type}: {rc.fix_recommendation}")

# Or run each step independently
failures = detect_failures(session, confidence_threshold=ConfidenceLevel.MEDIUM)
if failures.failures:
    rca = analyze_root_cause(session, failures=failures.failures)

# analyze_root_cause calls detect_failures automatically if failures is not provided
rca = analyze_root_cause(session)

Chaos Testing

Inject deterministic tool failures and response corruption to evaluate how an agent handles adverse conditions. Effects fire through Strands' native plugin hooks, and the user's task body stays chaos-free:

from strands import Agent
from strands_evals import Case
from strands_evals.chaos import (
    ChaosCase, ChaosExperiment, ChaosPlugin,
    Timeout, NetworkError, TruncateFields,
)

base_cases = [Case(name="flight_search", input="Find flights to Tokyo")]
effect_maps = {
    "search_timeout":  {"tool_effects": {"search_tool":   [Timeout()]}},
    "db_truncate":     {"tool_effects": {"database_tool": [TruncateFields(max_length=20)]}},
}
chaos_cases = ChaosCase.expand(base_cases, effect_maps, include_no_effect_baseline=True)

def task(case):
    agent = Agent(tools=[search_tool, database_tool], plugins=[ChaosPlugin()])
    return {"output": str(agent(case.input))}

report = ChaosExperiment(cases=chaos_cases, evaluators=[...]).run_evaluations(task=task)

Available effects: Timeout, NetworkError, ExecutionError, ValidationError (cancel the tool call); TruncateFields, RemoveFields, CorruptValues (corrupt the response). Pair with chaos-aware evaluators in strands_evals.evaluators.chaos (FailureCommunicationEvaluator, PartialCompletionEvaluator, RecoveryStrategyEvaluator). For the full authoring guide see SKILL.md.

Custom Evaluators with Structured Output

Create domain-specific evaluation logic with standardized output format:

from strands_evals.evaluators import Evaluator
from strands_evals.types import EvaluationData, EvaluationOutput

class PolicyComplianceEvaluator(Evaluator[str, str]):
    def evaluate(self, evaluation_case: EvaluationData[str, str]) -> EvaluationOutput:
        # Custom evaluation logic
        response = evaluation_case.actual_output
        
        # Check for policy violations
        violations = self._check_policy_violations(response)
        
        if not violations:
            return EvaluationOutput(
                score=1.0,
                test_pass=True,
                reason="Response complies with all policies",
                label="compliant"
            )
        else:
            return EvaluationOutput(
                score=0.0,
                test_pass=False,
                reason=f"Policy violations: {', '.join(violations)}",
                label="non_compliant"
            )
    
    def _check_policy_violations(self, response: str) -> list[str]:
        # Implementation details...
        return []

Tool Usage and Parameter Evaluation

Evaluate specific aspects of tool usage with specialized evaluators:

from strands_evals.evaluators import ToolSelectionAccuracyEvaluator, ToolParameterAccuracyEvaluator

# Evaluate if correct tools were selected
tool_selection_evaluator = ToolSelectionAccuracyEvaluator(
    rubric="Score 1.0 if optimal tools selected, 0.5 if suboptimal but functional, 0.0 if wrong tools"
)

# Evaluate if tool parameters were correct
tool_parameter_evaluator = ToolParameterAccuracyEvaluator(
    rubric="Score based on parameter accuracy and appropriateness for the task"
)

Multimodal Evaluation (MLLM-as-a-Judge)

Evaluate multimodal agent responses involving images and text using MLLM-as-a-Judge:

from strands_evals import Case, Experiment
from strands_evals.evaluators import (
    MultimodalCorrectnessEvaluator,
    MultimodalFaithfulnessEvaluator,
    MultimodalInstructionFollowingEvaluator,
    MultimodalOverallQualityEvaluator,
)
from strands_evals.types import ImageData, MultimodalInput

# Create image data from various sources
image = ImageData(source="path/to/image.png")
# Also supports: base64 strings, data URLs, HTTP URLs, PIL Images, raw bytes

# Define test cases with multimodal input
test_cases = [
    Case[MultimodalInput, str](
        name="image-description-1",
        input=MultimodalInput(
            media=image,
            instruction="Describe the contents of this image in detail.",
        ),
    )
]

# Use built-in evaluators with default rubrics (reference-free by default; reference-based rubric auto-selected when expected_output is provided)
evaluators = [
    MultimodalCorrectnessEvaluator(),       # Factual accuracy and completeness
    MultimodalFaithfulnessEvaluator(),      # Grounded in media without hallucinations
    MultimodalInstructionFollowingEvaluator(),  # Addresses query requirements
    MultimodalOverallQualityEvaluator(),    # Visual accuracy, adherence, completeness, coherence
]

# Run evaluation
experiment = Experiment[MultimodalInput, str](cases=test_cases, evaluators=evaluators)

def get_response(case: Case) -> str:
    agent = Agent(callback_handler=None)
    image = case.input.media
    return str(agent([
        {"image": {"format": image.format or "png", "source": {"bytes": image.to_bytes()}}},
        {"text": case.input.instruction}
    ]))

report = experiment.run_evaluations(get_response)
report.run_display()

Available Evaluators

Output-Based Evaluators

These evaluators work directly with inputs and outputs without requiring OpenTelemetry traces:

OutputEvaluator: Flexible LLM-based evaluation with custom rubrics
TrajectoryEvaluator: Action sequence evaluation with built-in scoring tools (supports both list-based trajectories and Session traces via extractors)
InteractionsEvaluator: Multi-agent interaction and handoff evaluation
Custom Evaluators: Extensible base class for domain-specific logic

Trace-Based Evaluators

These evaluators require OpenTelemetry traces (Session objects) to analyze agent behavior:

Tool-Level Evaluators

Evaluate individual tool calls within a conversation:

ToolSelectionAccuracyEvaluator: Evaluates appropriateness of tool choices at specific points
ToolParameterAccuracyEvaluator: Evaluates correctness of tool parameters based on context

Trace-Level Evaluators

Evaluate the most recent turn in a conversation:

HelpfulnessEvaluator: Seven-level helpfulness assessment from user perspective
FaithfulnessEvaluator: Evaluates if responses are grounded in conversation history
CoherenceEvaluator: Assesses logical cohesion and reasoning quality with five-level scoring
ConcisenessEvaluator: Evaluates response brevity with three-level scoring
ResponseRelevanceEvaluator: Evaluates relevance of responses to user questions
HarmfulnessEvaluator: Binary evaluation for harmful content detection
RefusalEvaluator: Binary evaluation for whether responses refuse to address the prompt
StereotypingEvaluator: Binary evaluation for bias or stereotypical content against groups
InstructionFollowingEvaluator: Binary evaluation for whether explicit instructions are followed

Session-Level Evaluators

Evaluate entire conversation sessions:

GoalSuccessRateEvaluator: Measures if user goals were achieved across the full conversation

Multimodal Evaluators

These evaluators assess multimodal (image-to-text) responses using MLLM-as-a-Judge, with built-in rubrics that support both reference-based and reference-free evaluation:

MultimodalCorrectnessEvaluator: Evaluates factual accuracy and completeness against media content
MultimodalFaithfulnessEvaluator: Evaluates whether responses are grounded in media without hallucinations
MultimodalInstructionFollowingEvaluator: Evaluates whether responses address query requirements and constraints
MultimodalOverallQualityEvaluator: Holistic evaluation across visual accuracy, instruction adherence, completeness, and coherence
MultimodalOutputEvaluator: Base class for custom multimodal evaluators with configurable rubrics

Detectors

Analyze agent sessions to identify failures and their root causes:

detect_failures: Scans a Session for failures with configurable confidence thresholds
analyze_root_cause: Identifies root causes for detected failures with fix recommendations
diagnose_session: End-to-end pipeline combining failure detection and root cause analysis

Experiment Management and Serialization

Save, load, and version experiments for reproducibility:

# Save experiment with metadata
experiment.to_file("customer_service_eval", "json")

# Load experiment from file
loaded_experiment = Experiment.from_file("./experiment_files/customer_service_eval.json", "json")

# Experiment files include:
# - Test cases with metadata
# - Evaluator configuration
# - Expected outputs and trajectories
# - Versioning information

Evaluation Metrics and Analysis

Track comprehensive metrics across multiple dimensions:

# Built-in metrics to consider:
metrics = {
    "accuracy": "Factual correctness of responses",
    "task_completion": "Whether agent completed the task",
    "tool_selection": "Appropriateness of tool choices", 
    "response_time": "Agent response latency",
    "hallucination_rate": "Frequency of fabricated information",
    "token_usage": "Efficiency of token consumption",
    "user_satisfaction": "Subjective helpfulness ratings"
}

# Generate analysis report
report = experiment.run_evaluations(task_function)
report.run_display()  # Interactive display with metrics breakdown

# Multi-evaluator runs return a single flattened report; each row is tagged with its evaluator
# via cases[i]["evaluator"], so you can filter or group without an extra flatten step.
report.display(include_recommendations=True)

Best Practices

Evaluation Strategy

Diversify Test Cases: Cover knowledge, reasoning, tool usage, conversation, edge cases, and safety scenarios
Use Statistical Baselines: Run multiple evaluations to account for LLM non-determinism
Combine Multiple Evaluators: Use output, trajectory, and helpfulness evaluators together
Regular Evaluation Cadence: Implement consistent evaluation schedules for continuous improvement

Performance Optimization

Use Extractors: Always use tools_use_extractor functions to prevent context overflow
Update Descriptions Dynamically: Call update_trajectory_description() with tool descriptions
Choose Appropriate Judge Models: Use stronger models for complex evaluations
Batch Evaluations: Process multiple test cases efficiently

Experiment Design

Write Clear Rubrics: Include explicit scoring criteria and examples
Include Expected Trajectories: Define exact sequences for trajectory evaluation
Use Appropriate Matching: Choose between exact, in-order, or any-order matching
Version Control: Track agent configurations alongside evaluation results

Documentation

For detailed guidance & examples, explore our documentation:

Contributing ❤️

We welcome contributions! See our Contributing Guide for details on:

Development setup
Contributing via Pull Requests
Code of Conduct
Reporting of security issues

Stay in touch with the team

Come meet the Strands team and other users on Discord

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Security

See CONTRIBUTING for more information.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

1.0.1

Jun 25, 2026

1.0.0

Jun 16, 2026

0.3.0

Jun 12, 2026

0.2.1

May 29, 2026

0.2.0

May 14, 2026

0.1.17

May 8, 2026

0.1.16

Apr 30, 2026

0.1.15

Apr 17, 2026

0.1.14

Apr 8, 2026

0.1.13

Mar 31, 2026

0.1.12

Mar 26, 2026

0.1.11

Mar 19, 2026

0.1.10

Mar 11, 2026

0.1.9

Mar 4, 2026

0.1.8

Feb 25, 2026

0.1.7

Feb 19, 2026

0.1.6

Feb 11, 2026

0.1.5

Feb 5, 2026

0.1.4

Jan 29, 2026

0.1.3

Jan 21, 2026

0.1.2

Jan 13, 2026

0.1.1

Dec 15, 2025

0.1.0

Dec 3, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

strands_agents_evals-1.0.1.tar.gz (451.2 kB view details)

Uploaded Jun 25, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

strands_agents_evals-1.0.1-py3-none-any.whl (352.7 kB view details)

Uploaded Jun 25, 2026 Python 3

File details

Details for the file strands_agents_evals-1.0.1.tar.gz.

File metadata

Download URL: strands_agents_evals-1.0.1.tar.gz
Upload date: Jun 25, 2026
Size: 451.2 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for strands_agents_evals-1.0.1.tar.gz
Algorithm	Hash digest
SHA256	`bcf11f1ee5f2d3b5703733595b20d5b3d87843a127057b4226d04cd75d670261`
MD5	`bfd2f072624aaea49270fcdad129e09f`
BLAKE2b-256	`37cdbf660bb933447b2cef740fac53e20afee03bb49c43e5d496c1e85925ebe7`

See more details on using hashes here.

Provenance

The following attestation bundles were made for strands_agents_evals-1.0.1.tar.gz:

Publisher: pypi-publish-on-release.yml on strands-agents/evals

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: strands_agents_evals-1.0.1.tar.gz
- Subject digest: bcf11f1ee5f2d3b5703733595b20d5b3d87843a127057b4226d04cd75d670261
- Sigstore transparency entry: 1958979514
- Sigstore integration time: Jun 25, 2026
Source repository:
- Permalink: strands-agents/evals@b99dc07dce9561cadc0546167b47911cb7d29415
- Branch / Tag: refs/tags/v1.0.1
- Owner: https://github.com/strands-agents
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: pypi-publish-on-release.yml@b99dc07dce9561cadc0546167b47911cb7d29415
- Trigger Event: release

File details

Details for the file strands_agents_evals-1.0.1-py3-none-any.whl.

File metadata

Download URL: strands_agents_evals-1.0.1-py3-none-any.whl
Upload date: Jun 25, 2026
Size: 352.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for strands_agents_evals-1.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b2cee1ca232c3120831b08d68965441a48aeed9fceba18b832bce7cbaa4b981a`
MD5	`e95414f785d417f9c654bf725a1d6fdc`
BLAKE2b-256	`98301ab14993e2f809d3b79d20b6befc133048ee8758a3a5701f1164a369d3f8`

See more details on using hashes here.

Provenance

The following attestation bundles were made for strands_agents_evals-1.0.1-py3-none-any.whl:

Publisher: pypi-publish-on-release.yml on strands-agents/evals

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: strands_agents_evals-1.0.1-py3-none-any.whl
- Subject digest: b2cee1ca232c3120831b08d68965441a48aeed9fceba18b832bce7cbaa4b981a
- Sigstore transparency entry: 1958979797
- Sigstore integration time: Jun 25, 2026
Source repository:
- Permalink: strands-agents/evals@b99dc07dce9561cadc0546167b47911cb7d29415
- Branch / Tag: refs/tags/v1.0.1
- Owner: https://github.com/strands-agents
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: pypi-publish-on-release.yml@b99dc07dce9561cadc0546167b47911cb7d29415
- Trigger Event: release

strands-agents-evals 1.0.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Strands Evals SDK

A comprehensive evaluation framework for AI agents and LLM applications.

Feature Overview

Quick Start

Command-Line Interface

Installation

Features at a Glance

Output Evaluation with Custom Rubrics

Trajectory Evaluation with Built-in Scoring

Trace-based Helpfulness Evaluation

Multi-turn Conversation Simulation

Tool Simulation

Automated Experiment Generation

Failure Detection & Diagnosis

Chaos Testing

Custom Evaluators with Structured Output

Tool Usage and Parameter Evaluation

Multimodal Evaluation (MLLM-as-a-Judge)

Available Evaluators

Output-Based Evaluators

Trace-Based Evaluators

Tool-Level Evaluators

Trace-Level Evaluators

Session-Level Evaluators

Multimodal Evaluators

Detectors

Experiment Management and Serialization

Evaluation Metrics and Analysis

Best Practices

Evaluation Strategy

Performance Optimization

Experiment Design

Documentation

Contributing ❤️

Stay in touch with the team

License

Security

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance