Evals framework for evaluating LLMs with structured outputs (i.e. JSON-mode)
Project description
structured-evals
A framework for evaluating LLMs with structured outputs (i.e., JSON-mode) using pydantic models.
🔍 Overview
structured-evals is a Python library for evaluating the quality of structured outputs from language models. It provides a flexible framework for registering and applying metrics to pydantic models (compatible with Pydantic v2), making it easy to evaluate the accuracy of model predictions against ground truth data.
✨ Features
- Structured Evaluation: Evaluate nested pydantic models with type-specific metrics
- Flexible Metric Registration: Register custom metrics for specific field types
- Pandas Integration: Convert evaluation results to pandas DataFrames for easy analysis
- Optional Integrations: Seamlessly integrate with other evaluation frameworks like
autoevalsand OpenAI'sevals - Comprehensive Type Support: Support for primitive types, lists, dictionaries, nested models, and more
📦 Installation
# Basic installation (requires Pydantic v2)
pip install structured-evals
# With optional dependencies
pip install structured-evals[autoevals] # For autoevals integration
pip install structured-evals[openai-evals] # For OpenAI evals integration
pip install structured-evals[all] # For all optional dependencies
🚀 Quickstart
from pydantic import BaseModel
from structured_evals.metrics import Evaluator, Metric
# Define your pydantic model
class Person(BaseModel):
name: str
age: int
is_active: bool
# Create ground truth and prediction instances
x_gt = Person(name="John Doe", age=30, is_active=True)
x = Person(name="John Doe", age=32, is_active=True)
# Create evaluator and register metrics
evaluator = Evaluator()
evaluator.register(Person, (int, float, bool, str), Metric(name="exact_match", cb=lambda a, b: float(a == b)))
evaluator.register(Person, (int, float), Metric(name="abs_diff", cb=lambda a, b: abs(a - b)))
# Evaluate prediction against ground truth
results = evaluator(x_gt, x)
# View results as a dictionary
print(results)
# Example output: {'Person.name': {'exact_match': 1.0}, 'Person.age': {'exact_match': 0.0, 'abs_diff': 2.0}, 'Person.is_active': {'exact_match': 1.0}}
🔧 Custom Metrics
You can register custom metrics for specific field types:
from structured_evals.metrics import Evaluator, Metric
# Define a custom metric using Jaccard similarity
def jaccard_similarity(text1: str, text2: str) -> float:
"""
Calculate Jaccard similarity between two strings.
Jaccard similarity is the size of the intersection divided by the size of the union of two sets.
"""
if not text1 or not text2:
return 0.0
set1 = set(text1.lower().split())
set2 = set(text2.lower().split())
intersection = len(set1.intersection(set2))
union = len(set1.union(set2))
return intersection / union if union > 0 else 0.0
# Register the custom metric
evaluator.register(
Person, # The model class
str, # The field type
Metric(name="jaccard_similarity", cb=jaccard_similarity)
)
# Evaluate with the custom metric
results = evaluator(x_gt, x)
# View results as a dictionary
print(results)
# Example output: {'Person.name': {'jaccard_similarity': 1.0}, 'Person.age': {'exact_match': 0.0, 'abs_diff': 2.0}, 'Person.is_active': {'exact_match': 1.0}}
🔄 Integration with Other Frameworks
You can easily integrate with other evaluation frameworks by creating custom metrics that wrap their functionality:
from structured_evals.metrics import Evaluator, Metric
# Example integration with autoevals
# Note: This requires installing the optional dependencies with `pip install structured-evals[autoevals]`
from autoevals.llm import Factuality
# Create the autoevals metric
factuality = Factuality()
def factuality_metric(text1: str, text2: str) -> float:
"""Evaluate factuality using autoevals."""
if not isinstance(text1, str) or not isinstance(text2, str):
return 0.0
result = factuality(text2, text1, input="Determine if the following two answers are consistent.")
return result.score
# Register the metric
evaluator.register(
Person,
str,
Metric(name="factuality", cb=factuality_metric)
)
# Evaluate with the integrated metrics
results = evaluator(x_gt, x)
# View results as a dictionary
print(results)
# Example output: {'Person.name': {'factuality': 0.95, 'exact_match': 1.0}, 'Person.age': {'exact_match': 0.0, 'abs_diff': 2.0}, 'Person.is_active': {'exact_match': 1.0}}
📝 Writing Custom Evals
See docs/custom-eval.md for detailed instructions on writing custom evaluations.
📄 License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file structured_evals-0.0.0.dev0.tar.gz.
File metadata
- Download URL: structured_evals-0.0.0.dev0.tar.gz
- Upload date:
- Size: 15.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2d86e1f226e1fa56ba65082d02528592e7058f3f6f6d5583f18da2226e557846
|
|
| MD5 |
6f646e7135dcf65074085a4cc42cf6f7
|
|
| BLAKE2b-256 |
dbf9608f32930144e8293f38cd52ea5bf80b1cc44c38275b582b1b23f7b49b3e
|
File details
Details for the file structured_evals-0.0.0.dev0-py3-none-any.whl.
File metadata
- Download URL: structured_evals-0.0.0.dev0-py3-none-any.whl
- Upload date:
- Size: 12.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
529966574090414fff059d451f3c2c72446d80a361182edf2a0c4ba5fd0cb404
|
|
| MD5 |
fb01809d41dbe7b81d690e7f6a4b7b28
|
|
| BLAKE2b-256 |
e8d5b9979e6e131f7b04b306260801c259ac6eededbd6bfa9d0d1d33e7290041
|