Evals framework for evaluating LLMs with structured outputs (i.e. JSON-mode)

These details have not been verified by PyPI

Project description

VLM Run Logo

Structured Evals

structured-evals

A framework for evaluating LLMs with structured outputs (i.e., JSON-mode) using pydantic models.

🔍 Overview

structured-evals is a Python library for evaluating the quality of structured outputs from language models. It provides a flexible framework for registering and applying metrics to pydantic models (compatible with Pydantic v2), making it easy to evaluate the accuracy of model predictions against ground truth data.

✨ Features

Structured Evaluation: Evaluate nested pydantic models with type-specific metrics
Flexible Metric Registration: Register custom metrics for specific field types
Pandas Integration: Convert evaluation results to pandas DataFrames for easy analysis
Optional Integrations: Seamlessly integrate with other evaluation frameworks like autoevals and OpenAI's evals
Comprehensive Type Support: Support for primitive types, lists, dictionaries, nested models, and more

📦 Installation

# Basic installation (requires Pydantic v2)
pip install structured-evals

# With optional dependencies
pip install structured-evals[autoevals]  # For autoevals integration
pip install structured-evals[openai-evals]  # For OpenAI evals integration
pip install structured-evals[all]  # For all optional dependencies

🚀 Quickstart

from pydantic import BaseModel
from structured_evals.metrics import Evaluator, Metric

# Define your pydantic model
class Person(BaseModel):
    name: str
    age: int
    is_active: bool

# Create ground truth and prediction instances
x_gt = Person(name="John Doe", age=30, is_active=True)
x = Person(name="John Doe", age=32, is_active=True)

# Create evaluator and register metrics
evaluator = Evaluator()
evaluator.register(Person, (int, float, bool, str), Metric(name="exact_match", cb=lambda a, b: float(a == b)))
evaluator.register(Person, (int, float), Metric(name="abs_diff", cb=lambda a, b: abs(a - b)))

# Evaluate prediction against ground truth
results = evaluator(x_gt, x)

# View results as a dictionary
print(results)
# Example output: {'Person.name': {'exact_match': 1.0}, 'Person.age': {'exact_match': 0.0, 'abs_diff': 2.0}, 'Person.is_active': {'exact_match': 1.0}}

🔧 Custom Metrics

You can register custom metrics for specific field types:

from structured_evals.metrics import Evaluator, Metric

# Define a custom metric using Jaccard similarity
def jaccard_similarity(text1: str, text2: str) -> float:
    """
    Calculate Jaccard similarity between two strings.
    
    Jaccard similarity is the size of the intersection divided by the size of the union of two sets.
    """
    if not text1 or not text2:
        return 0.0
    
    set1 = set(text1.lower().split())
    set2 = set(text2.lower().split())
    
    intersection = len(set1.intersection(set2))
    union = len(set1.union(set2))
    
    return intersection / union if union > 0 else 0.0

# Register the custom metric
evaluator.register(
    Person,  # The model class
    str,     # The field type
    Metric(name="jaccard_similarity", cb=jaccard_similarity)
)

# Evaluate with the custom metric
results = evaluator(x_gt, x)

# View results as a dictionary
print(results)
# Example output: {'Person.name': {'jaccard_similarity': 1.0}, 'Person.age': {'exact_match': 0.0, 'abs_diff': 2.0}, 'Person.is_active': {'exact_match': 1.0}}

🔄 Integration with Other Frameworks

You can easily integrate with other evaluation frameworks by creating custom metrics that wrap their functionality:

from structured_evals.metrics import Evaluator, Metric

# Example integration with autoevals
# Note: This requires installing the optional dependencies with `pip install structured-evals[autoevals]`
from autoevals.llm import Factuality

# Create the autoevals metric
factuality = Factuality()

def factuality_metric(text1: str, text2: str) -> float:
    """Evaluate factuality using autoevals."""
    if not isinstance(text1, str) or not isinstance(text2, str):
        return 0.0
    result = factuality(text2, text1, input="Determine if the following two answers are consistent.")
    return result.score

# Register the metric
evaluator.register(
    Person,
    str,
    Metric(name="factuality", cb=factuality_metric)
)

# Evaluate with the integrated metrics
results = evaluator(x_gt, x)

# View results as a dictionary
print(results)
# Example output: {'Person.name': {'factuality': 0.95, 'exact_match': 1.0}, 'Person.age': {'exact_match': 0.0, 'abs_diff': 2.0}, 'Person.is_active': {'exact_match': 1.0}}

📝 Writing Custom Evals

See docs/custom-eval.md for detailed instructions on writing custom evaluations.

📄 License

MIT

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.0.0.dev0 pre-release

Mar 18, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

structured_evals-0.0.0.dev0.tar.gz (15.1 kB view details)

Uploaded Mar 18, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

structured_evals-0.0.0.dev0-py3-none-any.whl (12.0 kB view details)

Uploaded Mar 18, 2025 Python 3

File details

Details for the file structured_evals-0.0.0.dev0.tar.gz.

File metadata

Download URL: structured_evals-0.0.0.dev0.tar.gz
Upload date: Mar 18, 2025
Size: 15.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for structured_evals-0.0.0.dev0.tar.gz
Algorithm	Hash digest
SHA256	`2d86e1f226e1fa56ba65082d02528592e7058f3f6f6d5583f18da2226e557846`
MD5	`6f646e7135dcf65074085a4cc42cf6f7`
BLAKE2b-256	`dbf9608f32930144e8293f38cd52ea5bf80b1cc44c38275b582b1b23f7b49b3e`

See more details on using hashes here.

File details

Details for the file structured_evals-0.0.0.dev0-py3-none-any.whl.

File metadata

Download URL: structured_evals-0.0.0.dev0-py3-none-any.whl
Upload date: Mar 18, 2025
Size: 12.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for structured_evals-0.0.0.dev0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`529966574090414fff059d451f3c2c72446d80a361182edf2a0c4ba5fd0cb404`
MD5	`fb01809d41dbe7b81d690e7f6a4b7b28`
BLAKE2b-256	`e8d5b9979e6e131f7b04b306260801c259ac6eededbd6bfa9d0d1d33e7290041`

See more details on using hashes here.

structured-evals 0.0.0.dev0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

Structured Evals

structured-evals

🔍 Overview

✨ Features

📦 Installation

🚀 Quickstart

🔧 Custom Metrics

🔄 Integration with Other Frameworks

📝 Writing Custom Evals

📄 License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes