Microsoft Azure Evaluation Library for Python
Project description
Azure AI Evaluation client library for Python
Use Azure AI Evaluation SDK to assess the performance of your generative AI applications. Generative AI application generations are quantitatively measured with mathematical based metrics, AI-assisted quality and safety metrics. Metrics are defined as evaluators
. Built-in or custom evaluators can provide comprehensive insights into the application's capabilities and limitations.
Use Azure AI Evaluation SDK to:
- Evaluate existing data from generative AI applications
- Evaluate generative AI applications
- Evaluate by generating mathematical, AI-assisted quality and safety metrics
Azure AI SDK provides following to evaluate Generative AI Applications:
- Evaluators - Generate scores individually or when used together with
evaluate
API. - Evaluate API - Python API to evaluate dataset or application using built-in or custom evaluators.
Source code | Package (PyPI) | API reference documentation | Product documentation | Samples
Getting started
Prerequisites
- Python 3.9 or later is required to use this package.
- [Optional] You must have Azure AI Foundry Project or Azure Open AI to use AI-assisted evaluators
Install the package
Install the Azure AI Evaluation SDK for Python with pip:
pip install azure-ai-evaluation
Key concepts
Evaluators
Evaluators are custom or prebuilt classes or functions that are designed to measure the quality of the outputs from language models or generative AI applications.
Built-in evaluators
Built-in evaluators are out of box evaluators provided by Microsoft:
Category | Evaluator class |
---|---|
Performance and quality (AI-assisted) | GroundednessEvaluator , RelevanceEvaluator , CoherenceEvaluator , FluencyEvaluator , SimilarityEvaluator , RetrievalEvaluator |
Performance and quality (NLP) | F1ScoreEvaluator , RougeScoreEvaluator , GleuScoreEvaluator , BleuScoreEvaluator , MeteorScoreEvaluator |
Risk and safety (AI-assisted) | ViolenceEvaluator , SexualEvaluator , SelfHarmEvaluator , HateUnfairnessEvaluator , IndirectAttackEvaluator , ProtectedMaterialEvaluator |
Composite | QAEvaluator , ContentSafetyEvaluator |
For more in-depth information on each evaluator definition and how it's calculated, see Evaluation and monitoring metrics for generative AI.
import os
from azure.ai.evaluation import evaluate, RelevanceEvaluator, ViolenceEvaluator, BleuScoreEvaluator
# NLP bleu score evaluator
bleu_score_evaluator = BleuScoreEvaluator()
result = bleu_score(
response="Tokyo is the capital of Japan.",
ground_truth="The capital of Japan is Tokyo."
)
# AI assisted quality evaluator
model_config = {
"azure_endpoint": os.environ.get("AZURE_OPENAI_ENDPOINT"),
"api_key": os.environ.get("AZURE_OPENAI_API_KEY"),
"azure_deployment": os.environ.get("AZURE_OPENAI_DEPLOYMENT"),
}
relevance_evaluator = RelevanceEvaluator(model_config)
result = relevance_evaluator(
query="What is the capital of Japan?",
response="The capital of Japan is Tokyo."
)
# AI assisted safety evaluator
azure_ai_project = {
"subscription_id": "<subscription_id>",
"resource_group_name": "<resource_group_name>",
"project_name": "<project_name>",
}
violence_evaluator = ViolenceEvaluator(azure_ai_project)
result = violence_evaluator(
query="What is the capital of France?",
response="Paris."
)
Custom evaluators
Built-in evaluators are great out of the box to start evaluating your application's generations. However you can build your own code-based or prompt-based evaluator to cater to your specific evaluation needs.
# Custom evaluator as a function to calculate response length
def response_length(response, **kwargs):
return len(response)
# Custom class based evaluator to check for blocked words
class BlocklistEvaluator:
def __init__(self, blocklist):
self._blocklist = blocklist
def __call__(self, *, response: str, **kwargs):
score = any([word in answer for word in self._blocklist])
return {"score": score}
blocklist_evaluator = BlocklistEvaluator(blocklist=["bad, worst, terrible"])
result = response_length("The capital of Japan is Tokyo.")
result = blocklist_evaluator(answer="The capital of Japan is Tokyo.")
Evaluate API
The package provides an evaluate
API which can be used to run multiple evaluators together to evaluate generative AI application response.
Evaluate existing dataset
from azure.ai.evaluation import evaluate
result = evaluate(
data="data.jsonl", # provide your data here
evaluators={
"blocklist": blocklist_evaluator,
"relevance": relevance_evaluator
},
# column mapping
evaluator_config={
"relevance": {
"column_mapping": {
"query": "${data.queries}"
"ground_truth": "${data.ground_truth}"
"response": "${outputs.response}"
}
}
}
# Optionally provide your AI Foundry project information to track your evaluation results in your Azure AI Foundry project
azure_ai_project = azure_ai_project,
# Optionally provide an output path to dump a json of metric summary, row level data and metric and AI Foundry URL
output_path="./evaluation_results.json"
)
For more details refer to Evaluate on test dataset using evaluate()
Evaluate generative AI application
from askwiki import askwiki
result = evaluate(
data="data.jsonl",
target=askwiki,
evaluators={
"relevance": relevance_eval
},
evaluator_config={
"default": {
"column_mapping": {
"query": "${data.queries}"
"context": "${outputs.context}"
"response": "${outputs.response}"
}
}
}
)
Above code snippet refers to askwiki application in this sample.
For more details refer to Evaluate on a target
Simulator
Simulators allow users to generate synthentic data using their application. Simulator expects the user to have a callback method that invokes their AI application. The intergration between your AI application and the simulator happens at the callback method. Here's how a sample callback would look like:
async def callback(
messages: Dict[str, List[Dict]],
stream: bool = False,
session_state: Any = None,
context: Optional[Dict[str, Any]] = None,
) -> dict:
messages_list = messages["messages"]
# Get the last message from the user
latest_message = messages_list[-1]
query = latest_message["content"]
# Call your endpoint or AI application here
# response should be a string
response = call_to_your_application(query, messages_list, context)
formatted_response = {
"content": response,
"role": "assistant",
"context": "",
}
messages["messages"].append(formatted_response)
return {"messages": messages["messages"], "stream": stream, "session_state": session_state, "context": context}
The simulator initialization and invocation looks like this:
from azure.ai.evaluation.simulator import Simulator
model_config = {
"azure_endpoint": os.environ.get("AZURE_ENDPOINT"),
"azure_deployment": os.environ.get("AZURE_DEPLOYMENT_NAME"),
"api_version": os.environ.get("AZURE_API_VERSION"),
}
custom_simulator = Simulator(model_config=model_config)
outputs = asyncio.run(custom_simulator(
target=callback,
conversation_turns=[
[
"What should I know about the public gardens in the US?",
],
[
"How do I simulate data against LLMs",
],
],
max_conversation_turns=2,
))
with open("simulator_output.jsonl", "w") as f:
for output in outputs:
f.write(output.to_eval_qr_json_lines())
Adversarial Simulator
from azure.ai.evaluation.simulator import AdversarialSimulator, AdversarialScenario
from azure.identity import DefaultAzureCredential
azure_ai_project = {
"subscription_id": <subscription_id>,
"resource_group_name": <resource_group_name>,
"project_name": <project_name>
}
scenario = AdversarialScenario.ADVERSARIAL_QA
simulator = AdversarialSimulator(azure_ai_project=azure_ai_project, credential=DefaultAzureCredential())
outputs = asyncio.run(
simulator(
scenario=scenario,
max_conversation_turns=1,
max_simulation_results=3,
target=callback
)
)
print(outputs.to_eval_qr_json_lines())
For more details about the simulator, visit the following links:
Examples
In following section you will find examples of:
- Evaluate an application
- Evaluate different models
- Custom Evaluators
- Adversarial Simulation
- Simulate with conversation starter
More examples can be found here.
Troubleshooting
General
Please refer to troubleshooting for common issues.
Logging
This library uses the standard logging library for logging. Basic information about HTTP sessions (URLs, headers, etc.) is logged at INFO level.
Detailed DEBUG level logging, including request/response bodies and unredacted
headers, can be enabled on a client with the logging_enable
argument.
See full SDK logging documentation with examples here.
Next steps
- View our samples.
- View our documentation
Contributing
This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit cla.microsoft.com.
When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.
Release History
1.5.0 (2025-04-04)
Features Added
- New
RedTeam
agent functionality to assess the safety and resilience of AI systems against adversarial prompt attacks
1.4.0 (2025-03-27)
Features Added
-
Enhanced binary evaluation results with customizable thresholds
- Added threshold support for QA and ContentSafety evaluators
- Evaluation results now include both the score and threshold values
- Configurable threshold parameter allows custom binary classification boundaries
- Default thresholds provided for backward compatibility
- Quality evaluators use "higher is better" scoring (score ≥ threshold is positive)
- Content safety evaluators use "lower is better" scoring (score ≤ threshold is positive)
-
New Built-in evaluator called CodeVulnerabilityEvaluator is added.
- It provides capabilities to identify the following code vulnerabilities.
- path-injection
- sql-injection
- code-injection
- stack-trace-exposure
- incomplete-url-substring-sanitization
- flask-debug
- clear-text-logging-sensitive-data
- incomplete-hostname-regexp
- server-side-unvalidated-url-redirection
- weak-cryptographic-algorithm
- full-ssrf
- bind-socket-all-network-interfaces
- client-side-unvalidated-url-redirection
- likely-bugs
- reflected-xss
- clear-text-storage-sensitive-data
- tarslip
- hardcoded-credentials
- insecure-randomness
- It also supports multiple coding languages such as (Python, Java, C++, C#, Go, Javascript, SQL)
- It provides capabilities to identify the following code vulnerabilities.
-
New Built-in evaluator called UngroundedAttributesEvaluator is added.
-
It evaluates ungrounded inference of human attributes for a given query, response, and context for a single-turn evaluation only,
-
where query represents the user query and response represents the AI system response given the provided context.
-
Ungrounded Attributes checks for whether a response is first, ungrounded, and checks if it contains information about protected class
-
or emotional state of a person.
-
It identifies the following attributes:
- emotional_state
- protected_class
- groundedness
-
-
New Built-in evaluators for Agent Evaluation (Preview)
- IntentResolutionEvaluator - Evaluates the intent resolution of an agent's response to a user query.
- ResponseCompletenessEvaluator - Evaluates the response completeness of an agent's response to a user query.
- TaskAdherenceEvaluator - Evaluates the task adherence of an agent's response to a user query.
- ToolCallAccuracyEvaluator - Evaluates the accuracy of tool calls made by an agent in response to a user query.
Bugs Fixed
- Fixed error in
GroundednessProEvaluator
when handling non-numeric values like "n/a" returned from the service. - Uploading local evaluation results from
evaluate
with the same run name will no longer result in each online run sharing (and bashing) result files.
1.3.0 (2025-02-28)
Breaking Changes
- Multimodal specific evaluators
ContentSafetyMultimodalEvaluator
,ViolenceMultimodalEvaluator
,SexualMultimodalEvaluator
,SelfHarmMultimodalEvaluator
,HateUnfairnessMultimodalEvaluator
andProtectedMaterialMultimodalEvaluator
has been removed. Please useContentSafetyEvaluator
,ViolenceEvaluator
,SexualEvaluator
,SelfHarmEvaluator
,HateUnfairnessEvaluator
andProtectedMaterialEvaluator
instead. - Metric name in ProtectedMaterialEvaluator's output is changed from
protected_material.fictional_characters_label
toprotected_material.fictional_characters_defect_rate
. It's now consistent with other evaluator's metric names (ending with_defect_rate
).
1.2.0 (2025-01-27)
Features Added
- CSV files are now supported as data file inputs with
evaluate()
API. The CSV file should have a header row with column names that match thedata
andtarget
fields in theevaluate()
method and the filename should be passed as thedata
parameter. Column name 'Conversation' in CSV file is not fully supported yet.
Breaking Changes
ViolenceMultimodalEvaluator
,SexualMultimodalEvaluator
,SelfHarmMultimodalEvaluator
,HateUnfairnessMultimodalEvaluator
andProtectedMaterialMultimodalEvaluator
will be removed in next release.
Bugs Fixed
- Removed
[remote]
extra. This is no longer needed when tracking results in Azure AI Studio. - Fixed
AttributeError: 'NoneType' object has no attribute 'get'
while running simulator with 1000+ results - Fixed the non adversarial simulator to run in task-free mode
- Content safety evaluators (violence, self harm, sexual, hate/unfairness) return the maximum result as the main score when aggregating per-turn evaluations from a conversation into an overall evaluation score. Other conversation-capable evaluators still default to a mean for aggregation.
- Fixed bug in non adversarial simulator sample where
tasks
undefined
Other Changes
- Changed minimum required python version to use this package from 3.8 to 3.9
- Stop dependency on the local promptflow service. No promptflow service will automatically start when running evaluation.
- Evaluators internally allow for custom aggregation. However, this causes serialization failures if evaluated while the
environment variable
AI_EVALS_BATCH_USE_ASYNC
is set to false.
1.1.0 (2024-12-12)
Features Added
- Added image support in
ContentSafetyEvaluator
,ViolenceEvaluator
,SexualEvaluator
,SelfHarmEvaluator
,HateUnfairnessEvaluator
andProtectedMaterialEvaluator
. Provide image URLs or base64 encoded images inconversation
input for image evaluation. See below for an example:
evaluator = ContentSafetyEvaluator(credential=azure_cred, azure_ai_project=project_scope)
conversation = {
"messages": [
{
"role": "system",
"content": [
{"type": "text", "text": "You are an AI assistant that understands images."}
],
},
{
"role": "user",
"content": [
{"type": "text", "text": "Can you describe this image?"},
{
"type": "image_url",
"image_url": {
"url": "https://cdn.britannica.com/68/178268-050-5B4E7FB6/Tom-Cruise-2013.jpg"
},
},
],
},
{
"role": "assistant",
"content": [
{
"type": "text",
"text": "The image shows a man with short brown hair smiling, wearing a dark-colored shirt.",
}
],
},
]
}
print("Calling Content Safety Evaluator for multi-modal")
score = evaluator(conversation=conversation)
- Please switch to generic evaluators for image evaluations as mentioned above.
ContentSafetyMultimodalEvaluator
,ContentSafetyMultimodalEvaluatorBase
,ViolenceMultimodalEvaluator
,SexualMultimodalEvaluator
,SelfHarmMultimodalEvaluator
,HateUnfairnessMultimodalEvaluator
andProtectedMaterialMultimodalEvaluator
will be deprecated in the next release.
Bugs Fixed
- Removed
[remote]
extra. This is no longer needed when tracking results in Azure AI Foundry portal. - Fixed
AttributeError: 'NoneType' object has no attribute 'get'
while running simulator with 1000+ results
1.0.1 (2024-11-15)
Bugs Fixed
- Removing
azure-ai-inference
as dependency. - Fixed
AttributeError: 'NoneType' object has no attribute 'get'
while running simulator with 1000+ results
1.0.0 (2024-11-13)
Breaking Changes
- The
parallel
parameter has been removed from composite evaluators:QAEvaluator
,ContentSafetyChatEvaluator
, andContentSafetyMultimodalEvaluator
. To control evaluator parallelism, you can now use the_parallel
keyword argument, though please note that this private parameter may change in the future. - Parameters
query_response_generating_prompty_kwargs
anduser_simulator_prompty_kwargs
have been renamed toquery_response_generating_prompty_options
anduser_simulator_prompty_options
in the Simulator's call method.
Bugs Fixed
- Fixed an issue where the
output_path
parameter in theevaluate
API did not support relative path. - Output of adversarial simulators are of type
JsonLineList
and the helper functionto_eval_qr_json_lines
now outputs context from both user and assistant turns along withcategory
if it exists in the conversation - Fixed an issue where during long-running simulations, API token expires causing "Forbidden" error. Instead, users can now set an environment variable
AZURE_TOKEN_REFRESH_INTERVAL
to refresh the token more frequently to prevent expiration and ensure continuous operation of the simulation. - Fixed an issue with the
ContentSafetyEvaluator
that caused parallel execution of sub-evaluators to fail. Parallel execution is now enabled by default again, but can still be disabled via the '_parallel' boolean keyword argument during class initialization. - Fix
evaluate
function not producing aggregated metrics if ANY values to be aggregated were None, NaN, or otherwise difficult to process. Such values are ignored fully, so the aggregated metric of[1, 2, 3, NaN]
would be 2, not 1.5.
Other Changes
- Refined error messages for serviced-based evaluators and simulators.
- Tracing has been disabled due to Cosmos DB initialization issue.
- Introduced environment variable
AI_EVALS_DISABLE_EXPERIMENTAL_WARNING
to disable the warning message for experimental features. - Changed the randomization pattern for
AdversarialSimulator
such that there is an almost equal number of Adversarial harm categories (e.g. Hate + Unfairness, Self-Harm, Violence, Sex) represented in theAdversarialSimulator
outputs. Previously, for 200max_simulation_results
a user might see 140 results belonging to the 'Hate + Unfairness' category and 40 results belonging to the 'Self-Harm' category. Now, user will see 50 results for each of Hate + Unfairness, Self-Harm, Violence, and Sex. - For the
DirectAttackSimulator
, the prompt templates used to generate simulated outputs for each Adversarial harm category will no longer be in a randomized order by default. To override this behavior, passrandomize_order=True
when you call theDirectAttackSimulator
, for example:
adversarial_simulator = DirectAttackSimulator(azure_ai_project=azure_ai_project, credential=DefaultAzureCredential())
outputs = asyncio.run(
adversarial_simulator(
scenario=scenario,
target=callback,
randomize_order=True
)
)
1.0.0b5 (2024-10-28)
Features Added
- Added
GroundednessProEvaluator
, which is a service-based evaluator for determining response groundedness. - Groundedness detection in Non Adversarial Simulator via query/context pairs
import importlib.resources as pkg_resources
package = "azure.ai.evaluation.simulator._data_sources"
resource_name = "grounding.json"
custom_simulator = Simulator(model_config=model_config)
conversation_turns = []
with pkg_resources.path(package, resource_name) as grounding_file:
with open(grounding_file, "r") as file:
data = json.load(file)
for item in data:
conversation_turns.append([item])
outputs = asyncio.run(custom_simulator(
target=callback,
conversation_turns=conversation_turns,
max_conversation_turns=1,
))
- Adding evaluator for multimodal use cases
Breaking Changes
- Renamed environment variable
PF_EVALS_BATCH_USE_ASYNC
toAI_EVALS_BATCH_USE_ASYNC
. RetrievalEvaluator
now requires acontext
input in addition toquery
in single-turn evaluation.RelevanceEvaluator
no longer takescontext
as an input. It now only takesquery
andresponse
in single-turn evaluation.FluencyEvaluator
no longer takesquery
as an input. It now only takesresponse
in single-turn evaluation.- AdversarialScenario enum does not include
ADVERSARIAL_INDIRECT_JAILBREAK
, invoking IndirectJailbreak or XPIA should be done withIndirectAttackSimulator
- Outputs of
Simulator
andAdversarialSimulator
previously hadto_eval_qa_json_lines
and now hasto_eval_qr_json_lines
. Whereto_eval_qa_json_lines
had:
{"question": <user_message>, "answer": <assistant_message>}
to_eval_qr_json_lines
now has:
{"query": <user_message>, "response": assistant_message}
Bugs Fixed
- Non adversarial simulator works with
gpt-4o
models using thejson_schema
response format - Fixed an issue where the
evaluate
API would fail with "[WinError 32] The process cannot access the file because it is being used by another process" when venv folder and target function file are in the same directory. - Fix evaluate API failure when
trace.destination
is set tonone
- Non adversarial simulator now accepts context from the callback
Other Changes
-
Improved error messages for the
evaluate
API by enhancing the validation of input parameters. This update provides more detailed and actionable error descriptions. -
GroundednessEvaluator
now supportsquery
as an optional input in single-turn evaluation. Ifquery
is provided, a different prompt template will be used for the evaluation. -
To align with our support of a diverse set of models, the following evaluators will now have a new key in their result output without the
gpt_
prefix. To maintain backwards compatibility, the old key with thegpt_
prefix will still be present in the output; however, it is recommended to use the new key moving forward as the old key will be deprecated in the future.CoherenceEvaluator
RelevanceEvaluator
FluencyEvaluator
GroundednessEvaluator
SimilarityEvaluator
RetrievalEvaluator
-
The following evaluators will now have a new key in their result output including LLM reasoning behind the score. The new key will follow the pattern "<metric_name>_reason". The reasoning is the result of a more detailed prompt template being used to generate the LLM response. Note that this requires the maximum number of tokens used to run these evaluators to be increased.
Evaluator New max_token
for GenerationCoherenceEvaluator
800 RelevanceEvaluator
800 FluencyEvaluator
800 GroundednessEvaluator
800 RetrievalEvaluator
1600 -
Improved the error message for storage access permission issues to provide clearer guidance for users.
1.0.0b4 (2024-10-16)
Breaking Changes
- Removed
numpy
dependency. All NaN values returned by the SDK have been changed to fromnumpy.nan
tomath.nan
. credential
is now required to be passed in for all content safety evaluators andProtectedMaterialsEvaluator
.DefaultAzureCredential
will no longer be chosen if a credential is not passed.- Changed package extra name from "pf-azure" to "remote".
Bugs Fixed
- Adversarial Conversation simulations would fail with
Forbidden
. Added logic to re-fetch token in the exponential retry logic to retrive RAI Service response. - Fixed an issue where the Evaluate API did not fail due to missing inputs when the target did not return columns required by the evaluators.
Other Changes
- Enhance the error message to provide clearer instruction when required packages for the remote tracking feature are missing.
- Print the per-evaluator run summary at the end of the Evaluate API call to make troubleshooting row-level failures easier.
1.0.0b3 (2024-10-01)
Features Added
- Added
type
field toAzureOpenAIModelConfiguration
andOpenAIModelConfiguration
- The following evaluators now support
conversation
as an alternative input to their usual single-turn inputs:ViolenceEvaluator
SexualEvaluator
SelfHarmEvaluator
HateUnfairnessEvaluator
ProtectedMaterialEvaluator
IndirectAttackEvaluator
CoherenceEvaluator
RelevanceEvaluator
FluencyEvaluator
GroundednessEvaluator
- Surfaced
RetrievalScoreEvaluator
, formally an internal part ofChatEvaluator
as a standalone conversation-only evaluator.
Breaking Changes
- Removed
ContentSafetyChatEvaluator
andChatEvaluator
- The
evaluator_config
parameter ofevaluate
now maps in evaluator name to a dictionaryEvaluatorConfig
, which is aTypedDict
. Thecolumn_mapping
betweendata
ortarget
and evaluator field names should now be specified inside this new dictionary:
Before:
evaluate(
...,
evaluator_config={
"hate_unfairness": {
"query": "${data.question}",
"response": "${data.answer}",
}
},
...
)
After
evaluate(
...,
evaluator_config={
"hate_unfairness": {
"column_mapping": {
"query": "${data.question}",
"response": "${data.answer}",
}
}
},
...
)
- Simulator now requires a model configuration to call the prompty instead of an Azure AI project scope. This enables the usage of simulator with Entra ID based auth. Before:
azure_ai_project = {
"subscription_id": os.environ.get("AZURE_SUBSCRIPTION_ID"),
"resource_group_name": os.environ.get("RESOURCE_GROUP"),
"project_name": os.environ.get("PROJECT_NAME"),
}
sim = Simulator(azure_ai_project=azure_ai_project, credentails=DefaultAzureCredentials())
After:
model_config = {
"azure_endpoint": os.environ.get("AZURE_OPENAI_ENDPOINT"),
"azure_deployment": os.environ.get("AZURE_DEPLOYMENT"),
}
sim = Simulator(model_config=model_config)
If api_key
is not included in the model_config
, the prompty runtime in promptflow-core
will pick up DefaultAzureCredential
.
Bugs Fixed
- Fixed issue where Entra ID authentication was not working with
AzureOpenAIModelConfiguration
1.0.0b2 (2024-09-24)
Breaking Changes
data
andevaluators
are now required keywords inevaluate
.
1.0.0b1 (2024-09-20)
Breaking Changes
- The
synthetic
namespace has been renamed tosimulator
, and sub-namespaces under this module have been removed - The
evaluate
andevaluators
namespaces have been removed, and everything previously exposed in those modules has been added to the root namespaceazure.ai.evaluation
- The parameter name
project_scope
in content safety evaluators have been renamed toazure_ai_project
for consistency with evaluate API and simulators. - Model configurations classes are now of type
TypedDict
and are exposed in theazure.ai.evaluation
module instead of coming frompromptflow.core
. - Updated the parameter names for
question
andanswer
in built-in evaluators to more generic terms:query
andresponse
.
Features Added
- First preview
- This package is port of
promptflow-evals
. New features will be added only to this package moving forward. - Added a
TypedDict
forAzureAIProject
that allows for better intellisense and type checking when passing in project information
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file azure_ai_evaluation-1.5.0.tar.gz
.
File metadata
- Download URL: azure_ai_evaluation-1.5.0.tar.gz
- Upload date:
- Size: 817.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: RestSharp/106.13.0.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 694e3bd635979348790c96eb43b390b89eb91ebd17e822229a32c9d2fdb77e6f |
|
MD5 | d3cfba7147f379839fd1b7e35331a5ff |
|
BLAKE2b-256 | 5a721a494053b221d0b607bfc84d540d9d1b6e002b17757f9372a61d054b18b5 |
File details
Details for the file azure_ai_evaluation-1.5.0-py3-none-any.whl
.
File metadata
- Download URL: azure_ai_evaluation-1.5.0-py3-none-any.whl
- Upload date:
- Size: 773.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: RestSharp/106.13.0.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2845898ef83f7097f201d8def4d8158221529f88102348a72b7962fc9605007a |
|
MD5 | d3a9f7979d6615346e77c44c175e9e2c |
|
BLAKE2b-256 | adcf59e8591f29fcf702e8340816fc16db1764fc420553f60e552ec590aa189e |