Skip to main content

Microsoft Azure Evaluation Library for Python

Project description

Azure AI Evaluation client library for Python

Use Azure AI Evaluation SDK to assess the performance of your generative AI applications. Generative AI application generations are quantitatively measured with mathematical based metrics, AI-assisted quality and safety metrics. Metrics are defined as evaluators. Built-in or custom evaluators can provide comprehensive insights into the application's capabilities and limitations.

Use Azure AI Evaluation SDK to:

  • Evaluate existing data from generative AI applications
  • Evaluate generative AI applications
  • Evaluate by generating mathematical, AI-assisted quality and safety metrics

Azure AI SDK provides following to evaluate Generative AI Applications:

  • Evaluators - Generate scores individually or when used together with evaluate API.
  • Evaluate API - Python API to evaluate dataset or application using built-in or custom evaluators.

Source code | Package (PyPI) | API reference documentation | Product documentation | Samples

Getting started

Prerequisites

Install the package

Install the Azure AI Evaluation SDK for Python with pip:

pip install azure-ai-evaluation

Key concepts

Evaluators

Evaluators are custom or prebuilt classes or functions that are designed to measure the quality of the outputs from language models or generative AI applications.

Built-in evaluators

Built-in evaluators are out of box evaluators provided by Microsoft:

Category Evaluator class
Performance and quality (AI-assisted) GroundednessEvaluator, RelevanceEvaluator, CoherenceEvaluator, FluencyEvaluator, SimilarityEvaluator, RetrievalEvaluator
Performance and quality (NLP) F1ScoreEvaluator, RougeScoreEvaluator, GleuScoreEvaluator, BleuScoreEvaluator, MeteorScoreEvaluator
Risk and safety (AI-assisted) ViolenceEvaluator, SexualEvaluator, SelfHarmEvaluator, HateUnfairnessEvaluator, IndirectAttackEvaluator, ProtectedMaterialEvaluator
Composite QAEvaluator, ContentSafetyEvaluator

For more in-depth information on each evaluator definition and how it's calculated, see Evaluation and monitoring metrics for generative AI.

import os

from azure.ai.evaluation import evaluate, RelevanceEvaluator, ViolenceEvaluator, BleuScoreEvaluator

# NLP bleu score evaluator
bleu_score_evaluator = BleuScoreEvaluator()
result = bleu_score(
    response="Tokyo is the capital of Japan.",
    ground_truth="The capital of Japan is Tokyo."
)

# AI assisted quality evaluator
model_config = {
    "azure_endpoint": os.environ.get("AZURE_OPENAI_ENDPOINT"),
    "api_key": os.environ.get("AZURE_OPENAI_API_KEY"),
    "azure_deployment": os.environ.get("AZURE_OPENAI_DEPLOYMENT"),
}

relevance_evaluator = RelevanceEvaluator(model_config)
result = relevance_evaluator(
    query="What is the capital of Japan?",
    response="The capital of Japan is Tokyo."
)

# There are two ways to provide Azure AI Project.
# Option #1 : Using Azure AI Project Details 
azure_ai_project = {
    "subscription_id": "<subscription_id>",
    "resource_group_name": "<resource_group_name>",
    "project_name": "<project_name>",
}

violence_evaluator = ViolenceEvaluator(azure_ai_project)
result = violence_evaluator(
    query="What is the capital of France?",
    response="Paris."
)

# Option # 2 : Using Azure AI Project Url 
azure_ai_project = "https://{resource_name}.services.ai.azure.com/api/projects/{project_name}"

violence_evaluator = ViolenceEvaluator(azure_ai_project)
result = violence_evaluator(
    query="What is the capital of France?",
    response="Paris."
)

Custom evaluators

Built-in evaluators are great out of the box to start evaluating your application's generations. However you can build your own code-based or prompt-based evaluator to cater to your specific evaluation needs.

# Custom evaluator as a function to calculate response length
def response_length(response, **kwargs):
    return len(response)

# Custom class based evaluator to check for blocked words
class BlocklistEvaluator:
    def __init__(self, blocklist):
        self._blocklist = blocklist

    def __call__(self, *, response: str, **kwargs):
        score = any([word in answer for word in self._blocklist])
        return {"score": score}

blocklist_evaluator = BlocklistEvaluator(blocklist=["bad, worst, terrible"])

result = response_length("The capital of Japan is Tokyo.")
result = blocklist_evaluator(answer="The capital of Japan is Tokyo.")

Evaluate API

The package provides an evaluate API which can be used to run multiple evaluators together to evaluate generative AI application response.

Evaluate existing dataset

from azure.ai.evaluation import evaluate

result = evaluate(
    data="data.jsonl", # provide your data here
    evaluators={
        "blocklist": blocklist_evaluator,
        "relevance": relevance_evaluator
    },
    # column mapping
    evaluator_config={
        "relevance": {
            "column_mapping": {
                "query": "${data.queries}"
                "ground_truth": "${data.ground_truth}"
                "response": "${outputs.response}"
            } 
        }
    }
    # Optionally provide your AI Foundry project information to track your evaluation results in your Azure AI Foundry project
    azure_ai_project = azure_ai_project,
    # Optionally provide an output path to dump a json of metric summary, row level data and metric and AI Foundry URL
    output_path="./evaluation_results.json"
)

For more details refer to Evaluate on test dataset using evaluate()

Evaluate generative AI application

from askwiki import askwiki

result = evaluate(
    data="data.jsonl",
    target=askwiki,
    evaluators={
        "relevance": relevance_eval
    },
    evaluator_config={
        "default": {
            "column_mapping": {
                "query": "${data.queries}"
                "context": "${outputs.context}"
                "response": "${outputs.response}"
            } 
        }
    }
)

Above code snippet refers to askwiki application in this sample.

For more details refer to Evaluate on a target

Simulator

Simulators allow users to generate synthentic data using their application. Simulator expects the user to have a callback method that invokes their AI application. The intergration between your AI application and the simulator happens at the callback method. Here's how a sample callback would look like:

async def callback(
    messages: Dict[str, List[Dict]],
    stream: bool = False,
    session_state: Any = None,
    context: Optional[Dict[str, Any]] = None,
) -> dict:
    messages_list = messages["messages"]
    # Get the last message from the user
    latest_message = messages_list[-1]
    query = latest_message["content"]
    # Call your endpoint or AI application here
    # response should be a string
    response = call_to_your_application(query, messages_list, context)
    formatted_response = {
        "content": response,
        "role": "assistant",
        "context": "",
    }
    messages["messages"].append(formatted_response)
    return {"messages": messages["messages"], "stream": stream, "session_state": session_state, "context": context}

The simulator initialization and invocation looks like this:

from azure.ai.evaluation.simulator import Simulator
model_config = {
    "azure_endpoint": os.environ.get("AZURE_ENDPOINT"),
    "azure_deployment": os.environ.get("AZURE_DEPLOYMENT_NAME"),
    "api_version": os.environ.get("AZURE_API_VERSION"),
}
custom_simulator = Simulator(model_config=model_config)
outputs = asyncio.run(custom_simulator(
    target=callback,
    conversation_turns=[
        [
            "What should I know about the public gardens in the US?",
        ],
        [
            "How do I simulate data against LLMs",
        ],
    ],
    max_conversation_turns=2,
))
with open("simulator_output.jsonl", "w") as f:
    for output in outputs:
        f.write(output.to_eval_qr_json_lines())

Adversarial Simulator

from azure.ai.evaluation.simulator import AdversarialSimulator, AdversarialScenario
from azure.identity import DefaultAzureCredential

# There are two ways to provide Azure AI Project.
# Option #1 : Using Azure AI Project 
azure_ai_project = {
    "subscription_id": <subscription_id>,
    "resource_group_name": <resource_group_name>,
    "project_name": <project_name>
}

# Option #2 : Using Azure AI Project Url 
azure_ai_project = "https://{resource_name}.services.ai.azure.com/api/projects/{project_name}"

scenario = AdversarialScenario.ADVERSARIAL_QA
simulator = AdversarialSimulator(azure_ai_project=azure_ai_project, credential=DefaultAzureCredential())

outputs = asyncio.run(
    simulator(
        scenario=scenario,
        max_conversation_turns=1,
        max_simulation_results=3,
        target=callback
    )
)

print(outputs.to_eval_qr_json_lines())

For more details about the simulator, visit the following links:

Examples

In following section you will find examples of:

More examples can be found here.

Troubleshooting

General

Please refer to troubleshooting for common issues.

Logging

This library uses the standard logging library for logging. Basic information about HTTP sessions (URLs, headers, etc.) is logged at INFO level.

Detailed DEBUG level logging, including request/response bodies and unredacted headers, can be enabled on a client with the logging_enable argument.

See full SDK logging documentation with examples here.

Next steps

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit cla.microsoft.com.

When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

Release History

1.15.3 (2026-02-25)

Bugs Fixed

  • Fixed Foundry red team double-evaluation that caused ~3x scan latency by removing redundant evaluation_processor.evaluate() call.
  • Fixed lost red team attack details where baseline overwrites wiped evaluation results.
  • Fixed incorrect per_testing_criteria that showed attack strategies alongside risk categories.
  • Fixed metadata leak of internal scorer fields (attack_success, attack_strategy, score) into results.json.
  • Improved error logging with run_id, display_name, and full stack traces for red team scan failures.

1.15.2 (2026-02-23)

Bugs Fixed

  • Fixed batch evaluation to report per-line errors only on the rows that failed.

1.15.1 (2026-02-19)

Bugs Fixed

  • Red Team Agent Scenario Integration: integrated PyRIT's FoundryScenario for attack orchestration with Azure-specific scoring and result processing.
  • Fixed total tokens calculation errors in evaluation results.
  • Fixed red team SDK run status updates not firing consistently, preventing runs from being stuck as "Running" in the UI.

1.15.0 (2026-02-03)

Bugs Fixed

  • Prevent recursive stdout/stderr forwarding when NodeLogManager is nested, avoiding RecursionError in concurrent evaluation runs.

Other Changes

  • The [redteam] extra now requires pyrit==0.11.0, which depends on pillow>=12.1.0. This conflicts with promptflow-devkit (pillow<=11.3.0). Use separate virtual environments if you need both packages.

1.14.0 (2026-01-05)

Bugs Fixed

  • Updated CodeVulnerability and UngroundedAttributes evaluators for RedTeam to use the binary true/false scoring pattern so their results align with service responses.
  • Fixed handling of nested fields for AOAI graders when using files as datasource
  • Fixed GroundednessEvaluator with query not honoring is_reasoning_model (and credential) when reloading the query prompty, which could cause max_tokens to be sent to reasoning models. #44385

1.13.7 (2025-11-14)

Bugs Fixed

  • Fixed NoneType error when generating usage summary in evaluation results.
  • Fixed results for f1_score.

1.13.6 (2025-11-12)

Bugs Fixed

  • Added detection and retry handling for network errors wrapped in generic exceptions with "Error sending prompt with conversation ID" message
  • Fix results for ungrounded_attributes
  • score_mode grader improvements
  • fix for Red Team to ensure hate/unfairness evaluation rows populate when OneDP sync evaluators report results under the hate_unfairness metric name.

1.13.5 (2025-11-10)

Bugs Fixed

  • TaskAdherenceEvaluator: treat tool definitions as optional so evaluations with only query/response inputs no longer raise “Either 'conversation' or individual inputs must be provided.”

1.13.4 (2025-11-10)

Bugs Fixed

  • Handle input data for evaluation result when evaluators.

1.13.3 (2025-11-08)

Other Changes

  • Added scenario property to red team evaluation request to align scores with red team concepts of attack success.

1.13.2 (2025-11-07)

Bugs Fixed

  • Added App Insights redaction for agent safety run telemetry so adversarial prompts are not stored in collected logs.

1.13.1 (2025-11-05)

Features Added

  • Improved RedTeam coverage across risk sub-categories to ensure comprehensive security testing
  • Made RedTeam's AttackStrategy.Tense seed prompts dynamic to allow use of this strategy with additional risk categories
  • Refactors error handling and result semantics in the RedTeam evaluation system to improve clarity and align with Attack Success Rate (ASR) conventions (passed=False means attack success)

Bugs Fixed

  • Fixed RedTeam evaluation error related to context handling for context-dependent risk categories
  • Fixed RedTeam prompt application for model targets during Indirect Jailbreak XPIA (Cross-Platform Indirect Attack)

1.13.0 (2025-10-30)

Features Added

  • Updated IndirectAttack risk category for RedTeam to IndirectJailbreak to better reflect its purpose. This change allows users to apply cross-domain prompt injection (XPIA) attack strategies across all risk categories, enabling more comprehensive security testing of AI systems against indirect prompt injection attacks during red teaming.

  • Added TaskAdherence, SensitiveDataLeakage, and ProhibitedActions as cloud-only agent safety risk categories for red teaming.

  • Updated all evaluators' output to be of the following schema:

    • gpt_{evaluator_name}, {evaluator_name}: float score,
    • {evaluator_name}_result: pass/fail based on threshold,
    • {evaluator_name}_reason, {evaluator_name}_threshold
    • {evaluator_name}_prompt_tokens, {evaluator_name}_completion_tokens, {evaluator_name}_total_tokens, {evaluator_name}_finish_reason
    • {evaluator_name}_model: model used for evaluation
    • {evaluator_name}_sample_input, {evaluator_name}_sample_output: input and output used for evaluation

    This change standardizes the output format across all evaluators and follows OTel convention.

Bugs Fixed

  • image_tag parameter in AzureOpenAIPythonGrader is now optional.

1.11.2 (2025-10-09)

Bugs Fixed

  • **kwargs in an evaluator signature receives input columns that are not otherwise named in the evaluator's signature

1.12.0 (2025-10-02)

Features Added

  • AOAI Graders now accept a "credential" parameter that can be used for authentication with an AzureOpenAIModelConfiguration
  • Added is_reasoning_model parameter support to CoherenceEvaluator, FluencyEvaluator, SimilarityEvaluator, GroundednessEvaluator, RetrievalEvaluator, and RelevanceEvaluator to enable reasoning model configuration for o1/o3 models.

Bugs Fixed

  • Support for multi-level nesting in OpenAI grader (experimental)

1.11.1 (2025-09-19)

Bugs Fixed

  • Pinning duckdb version to 1.3.2 for redteam extra to fix error TypeError: unhashable type: '_duckdb.typing.DuckDBPyType'

1.11.0 (2025-09-03)

Features Added

  • Added support for user-supplied tags in the evaluate function. Tags are key-value pairs that can be used for experiment tracking, A/B testing, filtering, and organizing evaluation runs. The function accepts a tags parameter.
  • Added support for user-supplied TokenCredentials with LLM based evaluators.
  • Enhanced GroundednessEvaluator to support AI agent evaluation with tool calls. The evaluator now accepts agent response data containing tool calls and can extract context from file_search tool results for groundedness assessment. This enables evaluation of AI agents that use tools to retrieve information and generate responses. Note: Agent groundedness evaluation is currently supported only when the file_search tool is used.
  • Added language parameter to RedTeam class for multilingual red team scanning support. The parameter accepts values from SupportedLanguages enum including English, Spanish, French, German, Italian, Portuguese, Japanese, Korean, and Simplified Chinese, enabling red team attacks to be generated and conducted in multiple languages.
  • Added support for IndirectAttack and UngroundedAttributes risk categories in RedTeam scanning. These new risk categories expand red team capabilities to detect cross-platform indirect attacks and evaluate ungrounded inferences about human attributes including emotional state and protected class information.

Bugs Fixed

  • Fixed issue where evaluation results were not properly aligned with input data, leading to incorrect metrics being reported.

Other Changes

  • Deprecating AdversarialSimulator in favor of the AI Red Teaming Agent. AdversarialSimulator will be removed in the next minor release.
  • Moved retry configuration constants (MAX_RETRY_ATTEMPTS, MAX_RETRY_WAIT_SECONDS, MIN_RETRY_WAIT_SECONDS) from RedTeam class to new RetryManager class for better code organization and configurability.

1.10.0 (2025-07-31)

Breaking Changes

  • Added evaluate_query parameter to all RAI service evaluators that can be passed as a keyword argument. This parameter controls whether queries are included in evaluation data when evaluating query-response pairs. Previously, queries were always included in evaluations. When set to True, both query and response will be evaluated; when set to False (default), only the response will be evaluated. This parameter is available across all RAI service evaluators including ContentSafetyEvaluator, ViolenceEvaluator, SexualEvaluator, SelfHarmEvaluator, HateUnfairnessEvaluator, ProtectedMaterialEvaluator, IndirectAttackEvaluator, CodeVulnerabilityEvaluator, UngroundedAttributesEvaluator, GroundednessProEvaluator, and EciEvaluator. Existing code that relies on queries being evaluated will need to explicitly set evaluate_query=True to maintain the previous behavior.

Features Added

  • Added support for Azure OpenAI Python grader via AzureOpenAIPythonGrader class, which serves as a wrapper around Azure Open AI Python grader configurations. This new grader object can be supplied to the main evaluate method as if it were a normal callable evaluator.
  • Added attack_success_thresholds parameter to RedTeam class for configuring custom thresholds that determine attack success. This allows users to set specific threshold values for each risk category, with scores greater than the threshold considered successful attacks (i.e. higher threshold means higher tolerance for harmful responses).
  • Enhanced threshold reporting in RedTeam results to include default threshold values when custom thresholds aren't specified, providing better transparency about the evaluation criteria used.

Bugs Fixed

  • Fixed red team scan output_path issue where individual evaluation results were overwriting each other instead of being preserved as separate files. Individual evaluations now create unique files while the user's output_path is reserved for final aggregated results.
  • Significant improvements to TaskAdherence evaluator. New version has less variance, is much faster and consumes fewer tokens.
  • Significant improvements to Relevance evaluator. New version has more concrete rubrics and has less variance, is much faster and consumes fewer tokens.

Other Changes

  • The default engine for evaluation was changed from promptflow (PFClient) to an in-SDK batch client (RunSubmitterClient)
    • Note: We've temporarily kept an escape hatch to fall back to the legacy promptflow implementation by setting _use_pf_client=True when invoking evaluate(). This is due to be removed in a future release.

1.9.0 (2025-07-02)

Features Added

  • Added support for Azure Open AI evaluation via AzureOpenAIScoreModelGrader class, which serves as a wrapper around Azure Open AI score model configurations. This new grader object can be supplied to the main evaluate method as if it were a normal callable evaluator.
  • Added new experimental risk categories ProtectedMaterial and CodeVulnerability for redteam agent scan.

Bugs Fixed

  • Significant improvements to IntentResolution evaluator. New version has less variance, is nearly 2x faster and consumes fewer tokens.

  • Fixes and improvements to ToolCallAccuracy evaluator. New version has less variance. and now works on all tool calls that happen in a turn at once. Previously, it worked on each tool call independently without having context on the other tool calls that happen in the same turn, and then aggregated the results to a score in the range [0-1]. The score range is now [1-5].

  • Fixed MeteorScoreEvaluator and other threshold-based evaluators returning incorrect binary results due to integer conversion of decimal scores. Previously, decimal scores like 0.9375 were incorrectly converted to integers (0) before threshold comparison, causing them to fail even when above the threshold. #41415

  • Added a new enum ADVERSARIAL_QA_DOCUMENTS which moves all the "file_content" type prompts away from ADVERSARIAL_QA to the new enum

  • AzureOpenAIScoreModelGrader evaluator now supports pass_threshold parameter to set the minimum score required for a response to be considered passing. This allows users to define custom thresholds for evaluation results, enhancing flexibility in grading AI model responses.

1.8.0 (2025-05-29)

Features Added

  • Introduces AttackStrategy.MultiTurn and AttackStrategy.Crescendo to RedTeam. These strategies attack the target of a RedTeam scan over the course of multi-turn conversations.

Bugs Fixed

  • AdversarialSimulator in ADVERSARIAL_CONVERSATION mode was broken. It is now fixed.

1.7.0 (2025-05-12)

Bugs Fixed

  • azure-ai-evaluation failed with module not found #40992

1.6.0 (2025-05-07)

Features Added

  • New <evaluator>.binary_aggregate field added to evaluation result metrics. This field contains the aggregated binary evaluation results for each evaluator, providing a summary of the evaluation outcomes.
  • Added support for Azure Open AI evaluation via 4 new 'grader' classes, which serve as wrappers around Azure Open AI grader configurations. These new grader objects can be supplied to the main evaluate method as if they were normal callable evaluators. The new classes are:
    • AzureOpenAIGrader (general class for experienced users)
    • AzureOpenAILabelGrader
    • AzureOpenAIStringCheckGrader
    • AzureOpenAITextSimilarityGrader

Breaking Changes

  • In the experimental RedTeam's scan method, the data_only param has been replaced with skip_evals and if you do not want data to be uploaded, use the skip_upload flag.

Bugs Fixed

  • Fixed error in evaluate where data fields could not contain numeric characters. Previously, a data file with schema:
    "query1": "some query", "response": "some response"
    
    throws error when passed into evaluator_config as {"evaluator_name": {"column_mapping": {"query": "${data.query1}", "response": "${data.response}"}},}. Now, users may import data containing fields with numeric characters.

1.5.0 (2025-04-04)

Features Added

  • New RedTeam agent functionality to assess the safety and resilience of AI systems against adversarial prompt attacks

1.4.0 (2025-03-27)

Features Added

  • Enhanced binary evaluation results with customizable thresholds

    • Added threshold support for QA and ContentSafety evaluators
    • Evaluation results now include both the score and threshold values
    • Configurable threshold parameter allows custom binary classification boundaries
    • Default thresholds provided for backward compatibility
    • Quality evaluators use "higher is better" scoring (score ≥ threshold is positive)
    • Content safety evaluators use "lower is better" scoring (score ≤ threshold is positive)
  • New Built-in evaluator called CodeVulnerabilityEvaluator is added.

    • It provides capabilities to identify the following code vulnerabilities.
      • path-injection
      • sql-injection
      • code-injection
      • stack-trace-exposure
      • incomplete-url-substring-sanitization
      • flask-debug
      • clear-text-logging-sensitive-data
      • incomplete-hostname-regexp
      • server-side-unvalidated-url-redirection
      • weak-cryptographic-algorithm
      • full-ssrf
      • bind-socket-all-network-interfaces
      • client-side-unvalidated-url-redirection
      • likely-bugs
      • reflected-xss
      • clear-text-storage-sensitive-data
      • tarslip
      • hardcoded-credentials
      • insecure-randomness
    • It also supports multiple coding languages such as (Python, Java, C++, C#, Go, Javascript, SQL)
  • New Built-in evaluator called UngroundedAttributesEvaluator is added.

    • It evaluates ungrounded inference of human attributes for a given query, response, and context for a single-turn evaluation only,

    • where query represents the user query and response represents the AI system response given the provided context.

    • Ungrounded Attributes checks for whether a response is first, ungrounded, and checks if it contains information about protected class

    • or emotional state of a person.

    • It identifies the following attributes:

      • emotional_state
      • protected_class
      • groundedness
  • New Built-in evaluators for Agent Evaluation (Preview)

    • IntentResolutionEvaluator - Evaluates the intent resolution of an agent's response to a user query.
    • ResponseCompletenessEvaluator - Evaluates the response completeness of an agent's response to a user query.
    • TaskAdherenceEvaluator - Evaluates the task adherence of an agent's response to a user query.
    • ToolCallAccuracyEvaluator - Evaluates the accuracy of tool calls made by an agent in response to a user query.

Bugs Fixed

  • Fixed error in GroundednessProEvaluator when handling non-numeric values like "n/a" returned from the service.
  • Uploading local evaluation results from evaluate with the same run name will no longer result in each online run sharing (and bashing) result files.

1.3.0 (2025-02-28)

Breaking Changes

  • Multimodal specific evaluators ContentSafetyMultimodalEvaluator, ViolenceMultimodalEvaluator, SexualMultimodalEvaluator, SelfHarmMultimodalEvaluator, HateUnfairnessMultimodalEvaluator and ProtectedMaterialMultimodalEvaluator has been removed. Please use ContentSafetyEvaluator, ViolenceEvaluator, SexualEvaluator, SelfHarmEvaluator, HateUnfairnessEvaluator and ProtectedMaterialEvaluator instead.
  • Metric name in ProtectedMaterialEvaluator's output is changed from protected_material.fictional_characters_label to protected_material.fictional_characters_defect_rate. It's now consistent with other evaluator's metric names (ending with _defect_rate).

1.2.0 (2025-01-27)

Features Added

  • CSV files are now supported as data file inputs with evaluate() API. The CSV file should have a header row with column names that match the data and target fields in the evaluate() method and the filename should be passed as the data parameter. Column name 'Conversation' in CSV file is not fully supported yet.

Breaking Changes

  • ViolenceMultimodalEvaluator, SexualMultimodalEvaluator, SelfHarmMultimodalEvaluator, HateUnfairnessMultimodalEvaluator and ProtectedMaterialMultimodalEvaluator will be removed in next release.

Bugs Fixed

  • Removed [remote] extra. This is no longer needed when tracking results in Azure AI Studio.
  • Fixed AttributeError: 'NoneType' object has no attribute 'get' while running simulator with 1000+ results
  • Fixed the non adversarial simulator to run in task-free mode
  • Content safety evaluators (violence, self harm, sexual, hate/unfairness) return the maximum result as the main score when aggregating per-turn evaluations from a conversation into an overall evaluation score. Other conversation-capable evaluators still default to a mean for aggregation.
  • Fixed bug in non adversarial simulator sample where tasks undefined

Other Changes

  • Changed minimum required python version to use this package from 3.8 to 3.9
  • Stop dependency on the local promptflow service. No promptflow service will automatically start when running evaluation.
  • Evaluators internally allow for custom aggregation. However, this causes serialization failures if evaluated while the environment variable AI_EVALS_BATCH_USE_ASYNC is set to false.

1.1.0 (2024-12-12)

Features Added

  • Added image support in ContentSafetyEvaluator, ViolenceEvaluator, SexualEvaluator, SelfHarmEvaluator, HateUnfairnessEvaluator and ProtectedMaterialEvaluator. Provide image URLs or base64 encoded images in conversation input for image evaluation. See below for an example:
evaluator = ContentSafetyEvaluator(credential=azure_cred, azure_ai_project=project_scope)
conversation = {
    "messages": [
        {
            "role": "system",
            "content": [
                {"type": "text", "text": "You are an AI assistant that understands images."}
            ],
        },
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Can you describe this image?"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://cdn.britannica.com/68/178268-050-5B4E7FB6/Tom-Cruise-2013.jpg"
                    },
                },
            ],
        },
        {
            "role": "assistant",
            "content": [
                {
                    "type": "text",
                    "text": "The image shows a man with short brown hair smiling, wearing a dark-colored shirt.",
                }
            ],
        },
    ]
}
print("Calling Content Safety Evaluator for multi-modal")
score = evaluator(conversation=conversation)
  • Please switch to generic evaluators for image evaluations as mentioned above. ContentSafetyMultimodalEvaluator, ContentSafetyMultimodalEvaluatorBase, ViolenceMultimodalEvaluator, SexualMultimodalEvaluator, SelfHarmMultimodalEvaluator, HateUnfairnessMultimodalEvaluator and ProtectedMaterialMultimodalEvaluator will be deprecated in the next release.

Bugs Fixed

  • Removed [remote] extra. This is no longer needed when tracking results in Azure AI Foundry portal.
  • Fixed AttributeError: 'NoneType' object has no attribute 'get' while running simulator with 1000+ results

1.0.1 (2024-11-15)

Bugs Fixed

  • Removing azure-ai-inference as dependency.
  • Fixed AttributeError: 'NoneType' object has no attribute 'get' while running simulator with 1000+ results

1.0.0 (2024-11-13)

Breaking Changes

  • The parallel parameter has been removed from composite evaluators: QAEvaluator, ContentSafetyChatEvaluator, and ContentSafetyMultimodalEvaluator. To control evaluator parallelism, you can now use the _parallel keyword argument, though please note that this private parameter may change in the future.
  • Parameters query_response_generating_prompty_kwargs and user_simulator_prompty_kwargs have been renamed to query_response_generating_prompty_options and user_simulator_prompty_options in the Simulator's call method.

Bugs Fixed

  • Fixed an issue where the output_path parameter in the evaluate API did not support relative path.
  • Output of adversarial simulators are of type JsonLineList and the helper function to_eval_qr_json_lines now outputs context from both user and assistant turns along with category if it exists in the conversation
  • Fixed an issue where during long-running simulations, API token expires causing "Forbidden" error. Instead, users can now set an environment variable AZURE_TOKEN_REFRESH_INTERVAL to refresh the token more frequently to prevent expiration and ensure continuous operation of the simulation.
  • Fixed an issue with the ContentSafetyEvaluator that caused parallel execution of sub-evaluators to fail. Parallel execution is now enabled by default again, but can still be disabled via the '_parallel' boolean keyword argument during class initialization.
  • Fix evaluate function not producing aggregated metrics if ANY values to be aggregated were None, NaN, or otherwise difficult to process. Such values are ignored fully, so the aggregated metric of [1, 2, 3, NaN] would be 2, not 1.5.

Other Changes

  • Refined error messages for serviced-based evaluators and simulators.
  • Tracing has been disabled due to Cosmos DB initialization issue.
  • Introduced environment variable AI_EVALS_DISABLE_EXPERIMENTAL_WARNING to disable the warning message for experimental features.
  • Changed the randomization pattern for AdversarialSimulator such that there is an almost equal number of Adversarial harm categories (e.g. Hate + Unfairness, Self-Harm, Violence, Sex) represented in the AdversarialSimulator outputs. Previously, for 200 max_simulation_results a user might see 140 results belonging to the 'Hate + Unfairness' category and 40 results belonging to the 'Self-Harm' category. Now, user will see 50 results for each of Hate + Unfairness, Self-Harm, Violence, and Sex.
  • For the DirectAttackSimulator, the prompt templates used to generate simulated outputs for each Adversarial harm category will no longer be in a randomized order by default. To override this behavior, pass randomize_order=True when you call the DirectAttackSimulator, for example:
adversarial_simulator = DirectAttackSimulator(azure_ai_project=azure_ai_project, credential=DefaultAzureCredential())
outputs = asyncio.run(
    adversarial_simulator(
        scenario=scenario,
        target=callback,
        randomize_order=True
    )
)

1.0.0b5 (2024-10-28)

Features Added

  • Added GroundednessProEvaluator, which is a service-based evaluator for determining response groundedness.
  • Groundedness detection in Non Adversarial Simulator via query/context pairs
import importlib.resources as pkg_resources
package = "azure.ai.evaluation.simulator._data_sources"
resource_name = "grounding.json"
custom_simulator = Simulator(model_config=model_config)
conversation_turns = []
with pkg_resources.path(package, resource_name) as grounding_file:
    with open(grounding_file, "r") as file:
        data = json.load(file)
for item in data:
    conversation_turns.append([item])
outputs = asyncio.run(custom_simulator(
    target=callback,
    conversation_turns=conversation_turns,
    max_conversation_turns=1,
))
  • Adding evaluator for multimodal use cases

Breaking Changes

  • Renamed environment variable PF_EVALS_BATCH_USE_ASYNC to AI_EVALS_BATCH_USE_ASYNC.
  • RetrievalEvaluator now requires a context input in addition to query in single-turn evaluation.
  • RelevanceEvaluator no longer takes context as an input. It now only takes query and response in single-turn evaluation.
  • FluencyEvaluator no longer takes query as an input. It now only takes response in single-turn evaluation.
  • AdversarialScenario enum does not include ADVERSARIAL_INDIRECT_JAILBREAK, invoking IndirectJailbreak or XPIA should be done with IndirectAttackSimulator
  • Outputs of Simulator and AdversarialSimulator previously had to_eval_qa_json_lines and now has to_eval_qr_json_lines. Where to_eval_qa_json_lines had:
{"question": <user_message>, "answer": <assistant_message>}

to_eval_qr_json_lines now has:

{"query": <user_message>, "response": assistant_message}

Bugs Fixed

  • Non adversarial simulator works with gpt-4o models using the json_schema response format
  • Fixed an issue where the evaluate API would fail with "[WinError 32] The process cannot access the file because it is being used by another process" when venv folder and target function file are in the same directory.
  • Fix evaluate API failure when trace.destination is set to none
  • Non adversarial simulator now accepts context from the callback

Other Changes

  • Improved error messages for the evaluate API by enhancing the validation of input parameters. This update provides more detailed and actionable error descriptions.

  • GroundednessEvaluator now supports query as an optional input in single-turn evaluation. If query is provided, a different prompt template will be used for the evaluation.

  • To align with our support of a diverse set of models, the following evaluators will now have a new key in their result output without the gpt_ prefix. To maintain backwards compatibility, the old key with the gpt_ prefix will still be present in the output; however, it is recommended to use the new key moving forward as the old key will be deprecated in the future.

    • CoherenceEvaluator
    • RelevanceEvaluator
    • FluencyEvaluator
    • GroundednessEvaluator
    • SimilarityEvaluator
    • RetrievalEvaluator
  • The following evaluators will now have a new key in their result output including LLM reasoning behind the score. The new key will follow the pattern "<metric_name>_reason". The reasoning is the result of a more detailed prompt template being used to generate the LLM response. Note that this requires the maximum number of tokens used to run these evaluators to be increased.

    Evaluator New max_token for Generation
    CoherenceEvaluator 800
    RelevanceEvaluator 800
    FluencyEvaluator 800
    GroundednessEvaluator 800
    RetrievalEvaluator 1600
  • Improved the error message for storage access permission issues to provide clearer guidance for users.

1.0.0b4 (2024-10-16)

Breaking Changes

  • Removed numpy dependency. All NaN values returned by the SDK have been changed to from numpy.nan to math.nan.
  • credential is now required to be passed in for all content safety evaluators and ProtectedMaterialsEvaluator. DefaultAzureCredential will no longer be chosen if a credential is not passed.
  • Changed package extra name from "pf-azure" to "remote".

Bugs Fixed

  • Adversarial Conversation simulations would fail with Forbidden. Added logic to re-fetch token in the exponential retry logic to retrive RAI Service response.
  • Fixed an issue where the Evaluate API did not fail due to missing inputs when the target did not return columns required by the evaluators.

Other Changes

  • Enhance the error message to provide clearer instruction when required packages for the remote tracking feature are missing.
  • Print the per-evaluator run summary at the end of the Evaluate API call to make troubleshooting row-level failures easier.

1.0.0b3 (2024-10-01)

Features Added

  • Added type field to AzureOpenAIModelConfiguration and OpenAIModelConfiguration
  • The following evaluators now support conversation as an alternative input to their usual single-turn inputs:
    • ViolenceEvaluator
    • SexualEvaluator
    • SelfHarmEvaluator
    • HateUnfairnessEvaluator
    • ProtectedMaterialEvaluator
    • IndirectAttackEvaluator
    • CoherenceEvaluator
    • RelevanceEvaluator
    • FluencyEvaluator
    • GroundednessEvaluator
  • Surfaced RetrievalScoreEvaluator, formally an internal part of ChatEvaluator as a standalone conversation-only evaluator.

Breaking Changes

  • Removed ContentSafetyChatEvaluator and ChatEvaluator
  • The evaluator_config parameter of evaluate now maps in evaluator name to a dictionary EvaluatorConfig, which is a TypedDict. The column_mapping between data or target and evaluator field names should now be specified inside this new dictionary:

Before:

evaluate(
    ...,
    evaluator_config={
        "hate_unfairness": {
            "query": "${data.question}",
            "response": "${data.answer}",
        }
    },
    ...
)

After

evaluate(
    ...,
    evaluator_config={
        "hate_unfairness": {
            "column_mapping": {
                "query": "${data.question}",
                "response": "${data.answer}",
             }
        }
    },
    ...
)
  • Simulator now requires a model configuration to call the prompty instead of an Azure AI project scope. This enables the usage of simulator with Entra ID based auth. Before:
azure_ai_project = {
    "subscription_id": os.environ.get("AZURE_SUBSCRIPTION_ID"),
    "resource_group_name": os.environ.get("RESOURCE_GROUP"),
    "project_name": os.environ.get("PROJECT_NAME"),
}
sim = Simulator(azure_ai_project=azure_ai_project, credentails=DefaultAzureCredentials())

After:

model_config = {
    "azure_endpoint": os.environ.get("AZURE_OPENAI_ENDPOINT"),
    "azure_deployment": os.environ.get("AZURE_DEPLOYMENT"),
}
sim = Simulator(model_config=model_config)

If api_key is not included in the model_config, the prompty runtime in promptflow-core will pick up DefaultAzureCredential.

Bugs Fixed

  • Fixed issue where Entra ID authentication was not working with AzureOpenAIModelConfiguration

1.0.0b2 (2024-09-24)

Breaking Changes

  • data and evaluators are now required keywords in evaluate.

1.0.0b1 (2024-09-20)

Breaking Changes

  • The synthetic namespace has been renamed to simulator, and sub-namespaces under this module have been removed
  • The evaluate and evaluators namespaces have been removed, and everything previously exposed in those modules has been added to the root namespace azure.ai.evaluation
  • The parameter name project_scope in content safety evaluators have been renamed to azure_ai_project for consistency with evaluate API and simulators.
  • Model configurations classes are now of type TypedDict and are exposed in the azure.ai.evaluation module instead of coming from promptflow.core.
  • Updated the parameter names for question and answer in built-in evaluators to more generic terms: query and response.

Features Added

  • First preview
  • This package is port of promptflow-evals. New features will be added only to this package moving forward.
  • Added a TypedDict for AzureAIProject that allows for better intellisense and type checking when passing in project information

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

azure_ai_evaluation-1.15.3.tar.gz (2.3 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

azure_ai_evaluation-1.15.3-py3-none-any.whl (1.2 MB view details)

Uploaded Python 3

File details

Details for the file azure_ai_evaluation-1.15.3.tar.gz.

File metadata

  • Download URL: azure_ai_evaluation-1.15.3.tar.gz
  • Upload date:
  • Size: 2.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: RestSharp/106.13.0.0

File hashes

Hashes for azure_ai_evaluation-1.15.3.tar.gz
Algorithm Hash digest
SHA256 23cf390c9c1306a2531bcb86da5a6bf469f459e17e5d7ec89039a619505b9640
MD5 efdba53cc65280ee0e7478c7c0975d20
BLAKE2b-256 653c53bac3b052aebdb1f2818bb51481b28113dca8dfece2caf4ccff8a2248a7

See more details on using hashes here.

File details

Details for the file azure_ai_evaluation-1.15.3-py3-none-any.whl.

File metadata

File hashes

Hashes for azure_ai_evaluation-1.15.3-py3-none-any.whl
Algorithm Hash digest
SHA256 8b29df1995e2b9618be23b6558d4357695d942b353d3ea3ee06b1e569f186a57
MD5 7b3f3cb024a32a500df19111915660e9
BLAKE2b-256 7916c20bcdd4593764abda6f076f5c23acb504e6e20aef1ad093573158f1b21e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page