dotevals

Simple and robust LLM evaluations

Project description

Dotevals Logo

Write Once, Evaluate Anywhere

Why dotevals?

Just like everyone, we had to write evaluations. They needed to run with structured generation, use complex datasets, run at scale, and allow for easy exploration of failure modes. We looked around, but couldn't find what we needed. So dotevals was born.

No YAML, just Python - Write evaluations as functions.
Works with pytest - Integrate with CI/CD, use parametrization and fixtures.
Resumable - Evaluations crash. dotevals picks them up where they left off.
Extensible - Plugin architecture for any dataset, evaluator, storage or LLM.
Scale when needed - Run dozens of experiments in parallel without changing the code.

The dotevals philosophy

Evaluations are just functions over data. Write a single function, we will handle running it at scale with:

Failure recovery.
Automatic and configurable concurrency.
Result persistence.
Resource management using pytest fixtures.

Focus on what to evaluate, not how to run evaluations.

Quick Start

Getting started with dotevals is simple:

1. Install dotevals

pip install dotevals  # Core functionality with basic evaluators
pip install dotevals-datasets-common  # Common benchmark datasets (GSM8K, MMLU, etc.)

2. Create a model

import outlines
import openai

model = outlines.from_openai(openai.AsyncOpenAI() "gpt4o")

3. Write the evaluation function

from dotevals import foreach
from dotevals.evaluators import numeric_match

dataset = [
    ("What is 2+2?", "4"),
    ("How many days are there in a week?", "7")
]

@foreach("question,answer", dataset)
def eval_math(question, answer):
    response = model(question)
    return numeric_match(response, answer)

4. Run interactively (notebooks/scripts)

from dotevals import run

# Run evaluation and get immediate results
results = run(eval_math)

# View summary
print(results.summary())
# {'total': 2, 'errors': 0, 'metrics': {'numeric_match': {'accuracy': 1.0}}}

5. Run with pytest (CI/CD)

pytest eval_math.py --experiment my_evaluation
dotevals show my_evaluation  # View results

Examples

Here are some examples that show how dotevals solves common problems:

🧮 Evaluate GPT-5 on GSM8K

import outlines
from openai import AsyncOpenAI
import pytest

@pytest.fixture()
def gpt5():
    model = outlines.from_openai(AsyncOpenAI(), "gpt-5")
    return model


@foreach.gsm8k("test")
async def eval_gsm8k(question, reasoning, answer, gpt5):
    result = model(prompt)
    result = extract_answer(result)

    return numeric_match(result, answer)

📊 Compare GPT-5 with Opus-4.1 on GSM8K

import outlines
from transformers import AutoTokenizer, AutoModelForCausalLM
import pytest

@pytest.fixture()
def models(request):
    provider_name, model_name = request.param.split(":")

    if provider_name == "openai":
        from openai import AsyncOpenAI
        return outlines.from_openai(AsyncOpenAI(), model_name)
    elif provider_name == "anthropic":
        from anthropic import AsyncAnthropic
        return outlines.from_anthropic(AsyncAnthropic(), model_name)
    else:
        raise ValueError(f"Model {model_name} for {provider_name} is not available")


@pytest.mark.parametrize("model_names", ["openai:gpt-5", "anthropic:opus-4.1"], indirect=True)
@foreach.gsm8k("test")
def eval_gsm8k(question, schema, model_names, models):
    result = model(prompt, schema)
    result = extract_answer(result)

    return numeric_match(result, answer)

🏗️ Evaluate Phi-3.5 with structured outputs on BFCL simple

import outlines
from transformers import AutoTokenizer, AutoModelForCausalLM
import pytest

@pytest.fixture()
def phi():
    tf_model = AutoModelForCausalLM.from_pretrained("microsoft/Phi-3.5-mini-Instruct")
    tf_tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3.5-mini-Instruct")
    model = outlines.from_transformers(tf_model, hf_tokenizer)
    return model


@foreach.bfcl("simple")
def eval_gsm8k(question, schema, phi):
    result = model(prompt, schema)
    result = extract_answer(result)

    return numeric_match(result, answer

⚡ Maximize throughput on a vLLM instance

Start by installing the vLLM plugin:

pip install dotevals-vllm

import outlines
from transformers import AutoTokenizer, AutoModelForCausalLM
import pytest

@pytest.fixture()
def phi(vllm):
    """
    The vLLM plugin provides a fixture that allows you to spin, use and shut down a vLLM instance locally.

    """
    handle = vllm.setup("Phi-3.5.-mini-instruct")
    yield handle.client
    handle.shutdown()


foreach = ForEach(concurrency=Adaptive())

@foreach.bfcl("simple")
def eval_gsm8k(question, schema, phi):
    result = model(prompt, schema)
    result = extract_answer(result)

    return numeric_match(result, answer)

📦 Store results in S3

You don't need to change your experiment's implementation, just install the S3 plugin and run the experiment with the storage option set to s3. The S3 plugin also provides other options to parametrize the storage.

pip install dotevals-s3

pytest --experiment experiment_name --storage s3 --bucket xxx

🧑‍⚖️ Use LLM-as-a-judge evaluators

You can define custom evaluators. We encourage you to develop a plugin to make it as easy as

pip install dotevals-llm-as-a-judge

Extensible by Design

dotevals is built with a plugin architecture that lets you extend every component:

🔌 Use Any LLM

# Models are provided via pytest fixtures
@pytest.fixture
def model():
    """Your model as a pytest fixture."""
    return load_your_model()  # OpenAI, Anthropic, HuggingFace, etc.

@foreach("prompt,expected", dataset)
def eval_with_model(prompt, expected, model):
    response = model.generate(prompt)
    return exact_match(response, expected)

# For Modal deployment (pip install dotevals-modal)
# The vllm_client fixture is automatically provided
@foreach("prompt,expected", dataset)
async def eval_modal(prompt, expected, vllm_client):
    response = await vllm_client.agenerate(prompt)
    return exact_match(response, expected)

💾 Store Anywhere

# JSON files (default)
pytest eval.py --storage json://.dotevals

# SQLite with SQL queries
pytest eval.py --storage sqlite://results.db

# Your custom backend
pytest eval.py --storage s3://bucket/path

🚀 Run Anywhere

# Local execution (default)
@foreach("input,output", dataset)
def eval_local(input, output, model):
    return exact_match(model(input), output)

# Distributed on Modal (pip install dotevals-modal)
@foreach("question,answer", dataset)
async def eval_distributed(question, answer, vllm_client):
    # vllm_client automatically injected by Modal runner
    response = await vllm_client.agenerate(question)
    return exact_match(response, answer)

# Run with: pytest eval.py --runner modal --modal-model meta-llama/Llama-3-8b

📊 Evaluate Anything

from dotevals.evaluators import evaluator
from dotevals.metrics import accuracy

# Built-in evaluators
from dotevals.evaluators import (
    exact_match,      # String equality
    numeric_match,    # Numeric comparison
    valid_json,       # JSON validation
    ast_evaluation,   # Function call validation
)

# Create custom evaluators in 4 lines
@evaluator(metrics=accuracy())
def domain_specific_match(response, expected):
    # Your evaluation logic
    return your_validation(response, expected)

# LLM-based evaluators (pip install dotevals-evaluators-llm)
from dotevals.evaluators import (
    llm_judge,           # LLM-based evaluation
    semantic_similarity, # Embedding similarity
    factual_consistency, # Fact checking
)

🔧 Execute however you want

When @foreach and @batch aren't enough, create your own execution strategy:

# Custom executor for async batch APIs (e.g., OpenAI Batch API)
@async_batch("question", dataset, model=gpt4_batch)
def eval_reasoning(question: list[str]) -> list[Result]:
    responses = model.generate(question)
    return [judge_reasoning(r) for r in responses]

# Returns immediately, processes in background
handle = eval_reasoning(session_manager)
results = handle.wait()  # Get results when ready

Build executors for:

Async APIs: Submit jobs and poll for results
Streaming endpoints: Process data as it arrives
Custom infrastructure: GPU batching, distributed workers
Special workflows: Checkpointing, caching, fallback strategies

Switch execution without changing evaluation logic - debug with @foreach, scale with @async_batch.

About .txt

Dotevals is developed and maintained by .txt, a company dedicated to making LLMs more reliable for production applications.

Our focus is on advancing structured generation technology through:

🧪 Cutting-edge Research: We publish our findings on structured generation
🚀 Enterprise-grade solutions: You can license our enterprise-grade libraries.
🧩 Open Source Collaboration: We believe in building in public and contributing to the community

Contributing

We welcome contributions! See our Contributing Guide for details.

License

MIT License - see LICENSE for details.

Project details

Release history Release notifications | RSS feed

0.28.3

Sep 3, 2025

This version

0.26.0

Aug 21, 2025

0.25.0

Aug 21, 2025

0.24.0

Aug 18, 2025

0.23.0

Aug 17, 2025

0.8.0

Sep 3, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dotevals-0.26.0.tar.gz (402.7 kB view details)

Uploaded Aug 21, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

dotevals-0.26.0-py3-none-any.whl (69.0 kB view details)

Uploaded Aug 21, 2025 Python 3

File details

Details for the file dotevals-0.26.0.tar.gz.

File metadata

Download URL: dotevals-0.26.0.tar.gz
Upload date: Aug 21, 2025
Size: 402.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.8.13

File hashes

Hashes for dotevals-0.26.0.tar.gz
Algorithm	Hash digest
SHA256	`7b190bd796505c50727d0baede01bdeeee3459859570daa75253d135dea33111`
MD5	`f34528e5d1927c89e78312c642c8b2ad`
BLAKE2b-256	`e73d60b3377abacbf622b7987152b844c4f05ee3b1767132a75c7f33b48b5318`

See more details on using hashes here.

File details

Details for the file dotevals-0.26.0-py3-none-any.whl.

File metadata

Download URL: dotevals-0.26.0-py3-none-any.whl
Upload date: Aug 21, 2025
Size: 69.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.8.13

File hashes

Hashes for dotevals-0.26.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f9b7639034e12a9bc1c35c6a401268ab4a58223e10a130339925c6c06b610470`
MD5	`463f67ddb7a2feeb8cde0131544e7cc5`
BLAKE2b-256	`390ba237408e4e0f9d123cd4ddef7c82e0f88ffdf1cd9d0c3c303e046aeaaf46`

See more details on using hashes here.

dotevals 0.26.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

Why dotevals?

The dotevals philosophy

Quick Start

1. Install dotevals

2. Create a model

3. Write the evaluation function

4. Run interactively (notebooks/scripts)

5. Run with pytest (CI/CD)

Examples

Extensible by Design

About .txt

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes