Skip to main content

Evals framework for evaluating LLMs with structured outputs (i.e. JSON-mode)

Project description

VLM Run Logo

Structured Evals

Website | Discord

License Discord

structured-evals

A framework for evaluating LLMs with structured outputs (i.e., JSON-mode) using pydantic models.

🔍 Overview

structured-evals is a Python library for evaluating the quality of structured outputs from language models. It provides a flexible framework for registering and applying metrics to pydantic models (compatible with Pydantic v2), making it easy to evaluate the accuracy of model predictions against ground truth data.

✨ Features

  • Structured Evaluation: Evaluate nested pydantic models with type-specific metrics
  • Flexible Metric Registration: Register custom metrics for specific field types
  • Pandas Integration: Convert evaluation results to pandas DataFrames for easy analysis
  • Optional Integrations: Seamlessly integrate with other evaluation frameworks like autoevals and OpenAI's evals
  • Comprehensive Type Support: Support for primitive types, lists, dictionaries, nested models, and more

📦 Installation

# Basic installation (requires Pydantic v2)
pip install structured-evals

# With optional dependencies
pip install structured-evals[autoevals]  # For autoevals integration
pip install structured-evals[openai-evals]  # For OpenAI evals integration
pip install structured-evals[all]  # For all optional dependencies

🚀 Quickstart

from pydantic import BaseModel
from structured_evals.metrics import Evaluator, Metric

# Define your pydantic model
class Person(BaseModel):
    name: str
    age: int
    is_active: bool

# Create ground truth and prediction instances
x_gt = Person(name="John Doe", age=30, is_active=True)
x = Person(name="John Doe", age=32, is_active=True)

# Create evaluator and register metrics
evaluator = Evaluator()
evaluator.register(Person, (int, float, bool, str), Metric(name="exact_match", cb=lambda a, b: float(a == b)))
evaluator.register(Person, (int, float), Metric(name="abs_diff", cb=lambda a, b: abs(a - b)))

# Evaluate prediction against ground truth
results = evaluator(x_gt, x)

# View results as a dictionary
print(results)
# Example output: {'Person.name': {'exact_match': 1.0}, 'Person.age': {'exact_match': 0.0, 'abs_diff': 2.0}, 'Person.is_active': {'exact_match': 1.0}}

🔧 Custom Metrics

You can register custom metrics for specific field types:

from structured_evals.metrics import Evaluator, Metric

# Define a custom metric using Jaccard similarity
def jaccard_similarity(text1: str, text2: str) -> float:
    """
    Calculate Jaccard similarity between two strings.
    
    Jaccard similarity is the size of the intersection divided by the size of the union of two sets.
    """
    if not text1 or not text2:
        return 0.0
    
    set1 = set(text1.lower().split())
    set2 = set(text2.lower().split())
    
    intersection = len(set1.intersection(set2))
    union = len(set1.union(set2))
    
    return intersection / union if union > 0 else 0.0

# Register the custom metric
evaluator.register(
    Person,  # The model class
    str,     # The field type
    Metric(name="jaccard_similarity", cb=jaccard_similarity)
)

# Evaluate with the custom metric
results = evaluator(x_gt, x)

# View results as a dictionary
print(results)
# Example output: {'Person.name': {'jaccard_similarity': 1.0}, 'Person.age': {'exact_match': 0.0, 'abs_diff': 2.0}, 'Person.is_active': {'exact_match': 1.0}}

🔄 Integration with Other Frameworks

You can easily integrate with other evaluation frameworks by creating custom metrics that wrap their functionality:

from structured_evals.metrics import Evaluator, Metric

# Example integration with autoevals
# Note: This requires installing the optional dependencies with `pip install structured-evals[autoevals]`
from autoevals.llm import Factuality

# Create the autoevals metric
factuality = Factuality()

def factuality_metric(text1: str, text2: str) -> float:
    """Evaluate factuality using autoevals."""
    if not isinstance(text1, str) or not isinstance(text2, str):
        return 0.0
    result = factuality(text2, text1, input="Determine if the following two answers are consistent.")
    return result.score

# Register the metric
evaluator.register(
    Person,
    str,
    Metric(name="factuality", cb=factuality_metric)
)

# Evaluate with the integrated metrics
results = evaluator(x_gt, x)

# View results as a dictionary
print(results)
# Example output: {'Person.name': {'factuality': 0.95, 'exact_match': 1.0}, 'Person.age': {'exact_match': 0.0, 'abs_diff': 2.0}, 'Person.is_active': {'exact_match': 1.0}}

📝 Writing Custom Evals

See docs/custom-eval.md for detailed instructions on writing custom evaluations.

📄 License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

structured_evals-0.0.0.dev0.tar.gz (15.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

structured_evals-0.0.0.dev0-py3-none-any.whl (12.0 kB view details)

Uploaded Python 3

File details

Details for the file structured_evals-0.0.0.dev0.tar.gz.

File metadata

  • Download URL: structured_evals-0.0.0.dev0.tar.gz
  • Upload date:
  • Size: 15.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for structured_evals-0.0.0.dev0.tar.gz
Algorithm Hash digest
SHA256 2d86e1f226e1fa56ba65082d02528592e7058f3f6f6d5583f18da2226e557846
MD5 6f646e7135dcf65074085a4cc42cf6f7
BLAKE2b-256 dbf9608f32930144e8293f38cd52ea5bf80b1cc44c38275b582b1b23f7b49b3e

See more details on using hashes here.

File details

Details for the file structured_evals-0.0.0.dev0-py3-none-any.whl.

File metadata

File hashes

Hashes for structured_evals-0.0.0.dev0-py3-none-any.whl
Algorithm Hash digest
SHA256 529966574090414fff059d451f3c2c72446d80a361182edf2a0c4ba5fd0cb404
MD5 fb01809d41dbe7b81d690e7f6a4b7b28
BLAKE2b-256 e8d5b9979e6e131f7b04b306260801c259ac6eededbd6bfa9d0d1d33e7290041

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page