Framework for evaluating stochastic code execution, especially code making use of LLMs
Project description
Pydantic Evals
This is a library for evaluating non-deterministic (or "stochastic") functions in Python. It provides a simple, Pythonic interface for defining and running stochastic functions, and analyzing the results of running those functions.
While this library is developed as part of Pydantic AI, it only uses Pydantic AI for a small subset of generative functionality internally, and it is designed to be used with arbitrary "stochastic function" implementations. In particular, it can be used with other (non-Pydantic AI) AI libraries, agent frameworks, etc.
As with Pydantic AI, this library prioritizes type safety and use of common Python syntax over esoteric, domain-specific use of Python syntax.
Full documentation is available at ai.pydantic.dev/evals.
Example
While you'd typically use Pydantic Evals with more complex functions (such as Pydantic AI agents or graphs), here's a quick example that evaluates a simple function against a test case using both custom and built-in evaluators:
from pydantic_evals import Case, Dataset
from pydantic_evals.evaluators import Evaluator, EvaluatorContext, IsInstance
# Define a test case with inputs and expected output
case = Case(
name='capital_question',
inputs='What is the capital of France?',
expected_output='Paris',
)
# Define a custom evaluator
class MatchAnswer(Evaluator[str, str]):
def evaluate(self, ctx: EvaluatorContext[str, str]) -> float:
if ctx.output == ctx.expected_output:
return 1.0
elif isinstance(ctx.output, str) and ctx.expected_output.lower() in ctx.output.lower():
return 0.8
return 0.0
# Create a dataset with the test case and evaluators
dataset = Dataset(
name='capital_eval',
cases=[case],
evaluators=[IsInstance(type_name='str'), MatchAnswer()],
)
# Define the function to evaluate
async def answer_question(question: str) -> str:
return 'Paris'
# Run the evaluation
report = dataset.evaluate_sync(answer_question)
report.print(include_input=True, include_output=True)
"""
Evaluation Summary: answer_question
┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━┓
┃ Case ID ┃ Inputs ┃ Outputs ┃ Scores ┃ Assertions ┃ Duration ┃
┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━┩
│ capital_question │ What is the capital of France? │ Paris │ MatchAnswer: 1.00 │ ✔ │ 10ms │
├──────────────────┼────────────────────────────────┼─────────┼───────────────────┼────────────┼──────────┤
│ Averages │ │ │ MatchAnswer: 1.00 │ 100.0% ✔ │ 10ms │
└──────────────────┴────────────────────────────────┴─────────┴───────────────────┴────────────┴──────────┘
"""
Using the library with more complex functions, such as Pydantic AI agents, is similar — all you need to do is define a task function wrapping the function you want to evaluate, with a signature that matches the inputs and outputs of your test cases.
Logfire Integration
Pydantic Evals uses OpenTelemetry to record traces for each case in your evaluations.
You can send these traces to any OpenTelemetry-compatible backend. For the best experience, we recommend Pydantic Logfire, which includes custom views for evals:
You'll see full details about the inputs, outputs, token usage, execution durations, etc. And you'll have access to the full trace for each case — ideal for debugging, writing path-aware evaluators, or running the similar evaluations against production traces.
Basic setup:
import logfire
logfire.configure(
send_to_logfire='if-token-present',
environment='development',
service_name='evals',
)
...
my_dataset.evaluate_sync(my_task)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pydantic_evals-1.77.0.tar.gz.
File metadata
- Download URL: pydantic_evals-1.77.0.tar.gz
- Upload date:
- Size: 65.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
64a12324c9a3f4fefa34b5a5eb4c2320c0976e425404372fa69f2872f585c3d0
|
|
| MD5 |
0da16287bc7f6e30d0be70955b669e59
|
|
| BLAKE2b-256 |
de238da617d3803362325790e73e6219dec359559c2ef2350b35dbeb9e39c92f
|
Provenance
The following attestation bundles were made for pydantic_evals-1.77.0.tar.gz:
Publisher:
ci.yml on pydantic/pydantic-ai
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pydantic_evals-1.77.0.tar.gz -
Subject digest:
64a12324c9a3f4fefa34b5a5eb4c2320c0976e425404372fa69f2872f585c3d0 - Sigstore transparency entry: 1221730113
- Sigstore integration time:
-
Permalink:
pydantic/pydantic-ai@a282bcade2c2a75847806bbba9cf511dd0623019 -
Branch / Tag:
refs/tags/v1.77.0 - Owner: https://github.com/pydantic
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci.yml@a282bcade2c2a75847806bbba9cf511dd0623019 -
Trigger Event:
push
-
Statement type:
File details
Details for the file pydantic_evals-1.77.0-py3-none-any.whl.
File metadata
- Download URL: pydantic_evals-1.77.0-py3-none-any.whl
- Upload date:
- Size: 77.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2b536081e36d70826da216a3a6df8d84e3ac1982fc3b8078c123fd539ce6bfb6
|
|
| MD5 |
4934f183893ab80d9ea793ea5a20410e
|
|
| BLAKE2b-256 |
e607b9e6ba4afcce41f70a0f149cd7bc299a0babaf9c62a57d7cb291679a6cc7
|
Provenance
The following attestation bundles were made for pydantic_evals-1.77.0-py3-none-any.whl:
Publisher:
ci.yml on pydantic/pydantic-ai
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pydantic_evals-1.77.0-py3-none-any.whl -
Subject digest:
2b536081e36d70826da216a3a6df8d84e3ac1982fc3b8078c123fd539ce6bfb6 - Sigstore transparency entry: 1221730254
- Sigstore integration time:
-
Permalink:
pydantic/pydantic-ai@a282bcade2c2a75847806bbba9cf511dd0623019 -
Branch / Tag:
refs/tags/v1.77.0 - Owner: https://github.com/pydantic
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci.yml@a282bcade2c2a75847806bbba9cf511dd0623019 -
Trigger Event:
push
-
Statement type: