Skip to main content

Simple and robust LLM evaluations

Project description

Dotevals Logo Dotevals Logo

Write Once, Evaluate Anywhere

Why dotevals?

Just like everyone, we had to write evaluations. They needed to run with structured generation, use complex datasets, run at scale, and allow for easy exploration of failure modes. We looked around, but couldn't find what we needed. So dotevals was born.

  • No complex YAML or DSLs, just familiar Python - Write evaluations as functions.
  • Works in notebooks - Seamless notebook integration for interactive development and rapid prototyping.
  • Works with pytest - Integrate with CI/CD, use parametrization and fixtures.
  • Automatic Resumption - Evaluations crash. dotevals picks them up where they left off.
  • Extensible by Design - Plugin architecture for any dataset, evaluator, storage or LLM.
  • Effortless Scaling - Run dozens of experiments in parallel without changing the code.

The dotevals philosophy

Evaluations are just functions over data. Write a single function, we will handle running it at scale with:

  • Failure recovery.
  • Automatic and configurable concurrency.
  • Result persistence.
  • Resource management using pytest fixtures.

Focus on what to evaluate, not how to run evaluations.

Extensible by Design

doteval is built with a plugin architecture that lets you extend every component:

🔌 Use Any LLM

dotevals integrates seamlessly with any LLM client, whether it's OpenAI, Anthropic, HuggingFace, or your own custom model. You can pass your model client to your evaluation function via a pytest fixture.

import pytest
from dotevals import foreach
from dotevals.evaluators import exact_match

# Example dataset (replace with your actual dataset)
dataset = [
    ("Hello world", "hello world"),
    ("The quick brown fox", "quick brown fox"),
]

@pytest.fixture
def my_openai_model():
    """Your OpenAI model client as a pytest fixture."""
    import openai
    return openai.AsyncClient()

@foreach("prompt,expected", dataset)
async def eval_with_transformers(prompt, expected, my_transformers_model):
    response = await my_openai_model.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}]
    )
    return exact_match(response.choices[0].message.content, expected)

You can just as easily use transformers models. Here's an example using a transformers model with outlines for structured generation (install outlines with pip install outlines):

import pytest
from dotevals import batch, Result
from dotevals.evaluators import exact_match

@pytest.fixture
def my_transformers_model():
    from transformers import AutoModelForCausalLM, AutoTokenizer
    import outlines

    hf_model = AutoModelForCausalLM.from_pretrained("microsoft/Phi-3.5-medium-instruct")
    hf_tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3.5-medium-instruct")

    return outlines.from_transformers(hf_model, hf_tokenizer)


@batch("prompt,expected", dataset, batch_size=8) # batch_size is used by the @batch decorator
async def eval_with_transformers(prompt, expected, my_transformers_model):
    response = await my_transformers_model(prompt)
    return exact_match(response, expected)
💾 Store Anywhere

dotevals automatically persists your evaluation results. By default, results are stored in local JSON files, but you can easily configure different storage backends.

  • JSON files (default): Stored in a local .dotevals directory.

    pytest eval.py
    
  • SQLite: For a lightweight, queryable database. (Install with pip install dotevals-storage-sqlite)

    pytest eval.py --storage sqlite://results.db
    
  • S3: For cloud storage of your results. (Install with pip install dotevals-s3)

    pytest --experiment experiment_name --storage s3://your-bucket/path
    

No need to change your evaluation code – just specify the storage backend when you run your evaluations.

🚀 Run Anywhere

dotevals allows you to run your evaluations in various environments, from local development to distributed cloud deployments. This is achieved through Executors, which define how your evaluation functions are executed.

  • Local Execution (default): Evaluations run sequentially on your local machine.

    @foreach("input,output", dataset)
    def eval_local(input, output, model):
        return exact_match(model(input), output)
    
  • Distributed Execution with Modal: Run your evaluations at scale on Modal, a cloud platform for running Python code. The dotevals-modal plugin provides an executor that handles the distributed execution. (Install with pip install dotevals-modal)

    @foreach("question,answer", dataset)
    async def eval_distributed(question, answer, modal_client):
        # modal_client is provided by the dotevals-modal plugin
        response = await modal_client.generate(question)
        return exact_match(response, answer)
    

    Run with:

    pytest eval.py --executor modal
    

Executors abstract away the execution environment, allowing you to write your evaluation logic once and run it anywhere.

📊 Evaluate Anything

dotevals provides a flexible evaluation system that allows you to define custom evaluation logic. You can use built-in evaluators or create your own.

  • Built-in Evaluators: Ready-to-use evaluators for common tasks.

    from dotevals.evaluators import (
        exact_match,      # String equality
        numeric_match,    # Numeric comparison
        valid_json,       # JSON validation
        ast_evaluation,   # Function call validation
    )
    
  • Custom Evaluator Functions: Easily create your own evaluation logic.

    from dotevals.evaluators import evaluator
    from dotevals.metrics import accuracy
    
    @evaluator(metrics=accuracy())
    def domain_specific_match(response, expected):
        # Your evaluation logic here
        return your_validation(response, expected)
    
  • LLM-based Evaluators: Leverage LLMs to judge model outputs. (Install with pip install dotevals-evaluators-llm)

    from dotevals_evaluators_llm.evaluators import (
        llm_judge,           # LLM-based evaluation
        semantic_similarity, # Embedding similarity
        factual_consistency, # Fact checking
    )
    

Quick Start

Getting started with dotevals is simple:

1. Install dotevals

pip install dotevals  # Core functionality with basic evaluators
pip install dotevals-datasets  # Common benchmark datasets (GSM8K, MMLU, etc.)

2. Create a model

Your model can be any Python object that can generate a response. For example, you can use OpenAI's client directly:

import openai

client = openai.OpenAI()

def my_model(prompt: str) -> str:
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

If you use libraries like outlines for structured generation, you can integrate them here as well.

3. Write the evaluation function

from dotevals import foreach
from dotevals.evaluators import numeric_match

dataset = [
    ("What is 2+2?", "4"),
    ("How many days are there in a week?", "7")
]

@foreach("question,answer", dataset)
def eval_math(question, answer):
    response = model(question)
    return numeric_match(response, answer)

4. Run interactively (notebooks/scripts)

from dotevals import run

# Run evaluation and get immediate results
results = run(eval_math)

# View summary
print(results.summary())
# {'total': 2, 'errors': 0, 'metrics': {'numeric_match': {'accuracy': 1.0}}}

5. Run with pytest (CI/CD)

pytest eval_math.py --experiment my_evaluation
dotevals show my_evaluation  # View results

Examples

Here are some examples that show how dotevals solves common problems:

# Helper function to extract answer from model response
def extract_answer(response: str) -> str:
    # Implement your logic to extract the answer from the model's raw response
    # This is a placeholder and needs to be adapted to your specific model's output format.
    return response.strip()

# Example dataset (replace with your actual dataset)
dataset = [
    ("What is 2+2?", "4"),
    ("What color is the sky?", "blue"),
]
🧮 Evaluate GPT-5 on GSM8K
import pytest
from dotevals import foreach, Result
from dotevals.evaluators.base import numeric_match

# Assuming you have an OpenAI client configured
class GPT5Model:
    def __init__(self):
        import openai
        self.client = openai.OpenAI()

    def __call__(self, prompt: str) -> str:
        response = self.client.chat.completions.create(
            model="gpt-5",
            messages=[{"role": "user", "content": prompt}]
        )
        return response.choices[0].message.content

@pytest.fixture()
def gpt5():
    return GPT5Model()

@foreach.gsm8k("test")
def eval_gsm8k(question, reasoning, answer, gpt5):
    response = gpt5(question)
    extracted_answer = extract_answer(response)

    return numeric_match(result, answer)
📊 Compare GPT-5 with Opus-4.1 on GSM8K
import pytest
from dotevals import foreach, Result
from dotevals.evaluators.base import numeric_match

# Assuming you have OpenAI and Anthropic clients configured
class OpenAIModel:
    def __init__(self, model_name: str):
        import openai
        self.client = openai.OpenAI()
        self.model_name = model_name

    def __call__(self, prompt: str) -> str:
        response = self.client.chat.completions.create(
            model=self.model_name,
            messages=[{"role": "user", "content": prompt}]
        )
        return response.choices[0].message.content

class AnthropicModel:
    def __init__(self, model_name: str):
        import anthropic
        self.client = anthropic.Anthropic()
        self.model_name = model_name

    def __call__(self, prompt: str) -> str:
        response = self.client.messages.create(
            model=self.model_name,
            messages=[{"role": "user", "content": prompt}]
        )
        return response.content[0].text

@pytest.fixture(params=["openai:gpt-5", "anthropic:opus-4.1"])
def models(request):
    provider_name, model_name = request.param.split(":")

    if provider_name == "openai":
        return OpenAIModel(model_name)
    elif provider_name == "anthropic":
        return AnthropicModel(model_name)
    else:
        raise ValueError(f"Model {model_name} for {provider_name} is not available")

@foreach.gsm8k("test")
def eval_gsm8k(question, reasoning, answer, models):
    # Assuming extract_answer is a helper function you define
    response = models(question)
    extracted_answer = extract_answer(response)

    return numeric_match(result, answer)
🏗️ Evaluate Phi-3.5 with structured outputs on BFCL simple
import pytest
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from dotevals import foreach, Result
from dotevals.evaluators.base import numeric_match

class Phi3Model:
    def __init__(self):
        model_name = "microsoft/Phi-3-mini-4k-instruct"
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForCausalLM.from_pretrained(model_name)

    def __call__(self, prompt: str) -> str:
        inputs = self.tokenizer(prompt, return_tensors="pt")
        outputs = self.model.generate(**inputs, max_new_tokens=50)
        return self.tokenizer.decode(outputs[0], skip_special_tokens=True)

@pytest.fixture()
def phi():
    return Phi3Model()

@foreach.bfcl("simple")
def eval_bfcl(question, schema, phi):
    # Assuming extract_answer is a helper function you define
    response = phi(question)
    extracted_answer = extract_answer(response)

    return numeric_match(result, answer
⚡ Maximize throughput on a vLLM instance

Start by installing the dotevals-vllm plugin:

pip install dotevals-vllm

Then, use the vllm_client fixture provided by the plugin:

import pytest
from dotevals import foreach, Result
from dotevals.evaluators.base import numeric_match
from dotevals.concurrency import Concurrency, adaptive

@pytest.fixture()
def vllm_model(vllm_client):
    """
    The `dotevals-vllm` plugin provides a `vllm_client` fixture that allows you to spin up, use, and shut down a vLLM instance locally or remotely.

    We wrap it with an adaptive concurrency strategy to maximize throughput.
    """
    # Use adaptive concurrency for self-hosted models to maximize throughput
    concurrency = Concurrency(adaptive(initial=20, max=100))
    return concurrency.wrap(vllm_client)

@foreach.bfcl("simple")
async def eval_vllm(question, schema, vllm_model):
    # Assuming extract_answer is a helper function you define
    response = await vllm_model.generate(question)
    extracted_answer = extract_answer(response)

    return numeric_match(result, answer)
📦 Store results in S3

You don't need to change your experiment's implementation, just install the S3 plugin and run the experiment with the storage option set to s3. The S3 plugin also provides other options to parametrize the storage.

(Install with pip install dotevals-s3)

pytest --experiment experiment_name --storage s3://your-bucket/path
🧑‍⚖️ Use LLM-as-a-judge evaluators

dotevals supports LLM-as-a-judge evaluators through the dotevals-evaluators-llm plugin. This allows you to use a large language model to evaluate the output of another model.

(Install with pip install dotevals-evaluators-llm)

from dotevals import foreach, Result
from dotevals_evaluators_llm.evaluators import llm_judge

@foreach("prompt,expected", dataset)
def eval_with_llm_judge(prompt, expected, llm_judge_model):
    # llm_judge_model is a fixture that provides an LLM for judging
    response = llm_judge_model.generate(prompt)
    score = llm_judge(response, expected)
    return Result(score)

Extensible by Design

dotevals is built with a plugin architecture that lets you extend every component:

🔌 Use Any LLM
# Models are provided via pytest fixtures
@pytest.fixture
def model():
    """Your model as a pytest fixture."""
    return load_your_model()  # OpenAI, Anthropic, HuggingFace, etc.

@foreach("prompt,expected", dataset)
def eval_with_model(prompt, expected, model):
    response = model.generate(prompt)
    return exact_match(response, expected)

# For Modal deployment (pip install dotevals-modal)
# The vllm_client fixture is automatically provided
@foreach("prompt,expected", dataset)
async def eval_modal(prompt, expected, vllm_client):
    response = await vllm_client.agenerate(prompt)
    return exact_match(response, expected)
💾 Store Anywhere
# JSON files (default)
pytest eval.py --storage json://.dotevals

# SQLite with SQL queries
pytest eval.py --storage sqlite://results.db

# Your custom backend
pytest eval.py --storage s3://bucket/path
🚀 Run Anywhere
# Local execution (default)
@foreach("input,output", dataset)
def eval_local(input, output, model):
    return exact_match(model(input), output)

# Distributed on Modal (pip install dotevals-modal)
@foreach("question,answer", dataset)
async def eval_distributed(question, answer, vllm_client):
    # vllm_client automatically injected by Modal runner
    response = await vllm_client.agenerate(question)
    return exact_match(response, answer)

# Run with: pytest eval.py --runner modal --modal-model meta-llama/Llama-3-8b
📊 Evaluate Anything
from dotevals.evaluators import evaluator
from dotevals.metrics import accuracy

# Built-in evaluators
from dotevals.evaluators import (
    exact_match,      # String equality
    numeric_match,    # Numeric comparison
    valid_json,       # JSON validation
    ast_evaluation,   # Function call validation
)

# Create custom evaluators in 4 lines
@evaluator(metrics=accuracy())
def domain_specific_match(response, expected):
    # Your evaluation logic
    return your_validation(response, expected)

# LLM-based evaluators (pip install dotevals-evaluators-llm)
from dotevals.evaluators import (
    llm_judge,           # LLM-based evaluation
    semantic_similarity, # Embedding similarity
    factual_consistency, # Fact checking
)
🔧 Execute however you want

When @foreach and @batch aren't enough, create your own execution strategy:

# Custom executor for async batch APIs (e.g., OpenAI Batch API)
@async_batch("question", dataset, model=gpt4_batch)
def eval_reasoning(question: list[str]) -> list[Result]:
    responses = model.generate(question)
    return [judge_reasoning(r) for r in responses]

# Returns immediately, processes in background
handle = eval_reasoning(session_manager)
results = handle.wait()  # Get results when ready

Build executors for:

  • Async APIs: Submit jobs and poll for results
  • Streaming endpoints: Process data as it arrives
  • Custom infrastructure: GPU batching, distributed workers
  • Special workflows: Checkpointing, caching, fallback strategies

Switch execution without changing evaluation logic - debug with @foreach, scale with @async_batch.

About .txt

dottxt logo dottxt logo

Dotevals is developed and maintained by .txt, a company dedicated to making LLMs more reliable for production applications.

Our focus is on advancing structured generation technology through:

  • 🧪 Cutting-edge Research: We publish our findings on structured generation
  • 🚀 Enterprise-grade solutions: You can license our enterprise-grade libraries.
  • 🧩 Open Source Collaboration: We believe in building in public and contributing to the community

Follow us on Twitter or check out our blog to stay updated on our latest work in making LLMs more reliable.

Contributing

We welcome contributions! See our Contributing Guide for details.

License

MIT License - see LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dotevals-0.8.0.tar.gz (114.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dotevals-0.8.0-py3-none-any.whl (62.1 kB view details)

Uploaded Python 3

File details

Details for the file dotevals-0.8.0.tar.gz.

File metadata

  • Download URL: dotevals-0.8.0.tar.gz
  • Upload date:
  • Size: 114.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.14

File hashes

Hashes for dotevals-0.8.0.tar.gz
Algorithm Hash digest
SHA256 26d124390315a4817e402109ab783371288a3454fdc39948843eb08167b78ec0
MD5 6220d8aba6505e19ffe168a01303e030
BLAKE2b-256 d899de053f5833a705e69cf24518e76417733859a58c114eb1657b416375875a

See more details on using hashes here.

File details

Details for the file dotevals-0.8.0-py3-none-any.whl.

File metadata

  • Download URL: dotevals-0.8.0-py3-none-any.whl
  • Upload date:
  • Size: 62.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.14

File hashes

Hashes for dotevals-0.8.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5a41025a042cd0ce9d370b0fcb8005761c01e743f8c6446e4ce7f23b180b721f
MD5 b320e5e84b395267ead5d6105dcc69ab
BLAKE2b-256 b50f1a731bd8e146fcf884b0a28c56549db7d5448f56987f0e743e7b4c26ca46

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page