Universal library for evaluating AI models

These details have not been verified by PyPI

Project links

Project description

Autoevals

Autoevals is a tool to quickly and easily evaluate AI model outputs.

It bundles together a variety of automatic evaluation methods including:

LLM-as-a-judge
Heuristic (e.g. Levenshtein distance)
Statistical (e.g. BLEU)

Autoevals is developed by the team at Braintrust.

Autoevals uses model-graded evaluation for a variety of subjective tasks including fact checking, safety, and more. Many of these evaluations are adapted from OpenAI's excellent evals project but are implemented so you can flexibly run them on individual examples, tweak the prompts, and debug their outputs.

You can also create your own model-graded evaluations with Autoevals. It's easy to add custom prompts, parse outputs, and manage exceptions.

Requirements

Python 3.9 or higher
Compatible with both OpenAI Python SDK v0.x and v1.x

Installation

TypeScript

npm install autoevals

Python

pip install autoevals

Getting started

Use Autoevals to model-grade an example LLM completion using the Factuality prompt. By default, Autoevals uses your OPENAI_API_KEY environment variable to authenticate with OpenAI's API.

Python

from autoevals.llm import *
import asyncio

# Create a new LLM-based evaluator
evaluator = Factuality()

# Synchronous evaluation
input = "Which country has the highest population?"
output = "People's Republic of China"
expected = "China"

# Using the synchronous API
result = evaluator(output, expected, input=input)
print(f"Factuality score (sync): {result.score}")
print(f"Factuality metadata (sync): {result.metadata['rationale']}")

# Using the asynchronous API
async def main():
    result = await evaluator.eval_async(output, expected, input=input)
    print(f"Factuality score (async): {result.score}")
    print(f"Factuality metadata (async): {result.metadata['rationale']}")

# Run the async example
asyncio.run(main())

TypeScript

import { Factuality } from "autoevals";

(async () => {
  const input = "Which country has the highest population?";
  const output = "People's Republic of China";
  const expected = "China";

  const result = await Factuality({ output, expected, input });
  console.log(`Factuality score: ${result.score}`);
  console.log(`Factuality metadata: ${result.metadata?.rationale}`);
})();

Using other AI providers

When you use Autoevals, it will look for an OPENAI_BASE_URL environment variable to use as the base for requests to an OpenAI compatible API. If OPENAI_BASE_URL is not set, it will default to the AI proxy.

If you choose to use the proxy, you'll also get:

Simplified access to many AI providers
Reduced costs with automatic request caching
Increased observability when you enable logging to Braintrust

The proxy is free to use, even if you don't have a Braintrust account.

If you have a Braintrust account, you can optionally set the BRAINTRUST_API_KEY environment variable instead of OPENAI_API_KEY to unlock additional features like logging and monitoring. You can also route requests to supported AI providers and models or custom models you have configured in Braintrust.

Python

# NOTE: ensure BRAINTRUST_API_KEY is set in your environment and OPENAI_API_KEY is not set
from autoevals.llm import *

# Create an LLM-based evaluator using the Claude 3.5 Sonnet model from Anthropic
evaluator = Factuality(model="claude-3-5-sonnet-latest")

# Evaluate an example LLM completion
input = "Which country has the highest population?"
output = "People's Republic of China"
expected = "China"

result = evaluator(output, expected, input=input)

# The evaluator returns a score from [0,1] and includes the raw outputs from the evaluator
print(f"Factuality score: {result.score}")
print(f"Factuality metadata: {result.metadata['rationale']}")

TypeScript

// NOTE: ensure BRAINTRUST_API_KEY is set in your environment and OPENAI_API_KEY is not set
import { Factuality } from "autoevals";

(async () => {
  const input = "Which country has the highest population?";
  const output = "People's Republic of China";
  const expected = "China";

  // Run an LLM-based evaluator using the Claude 3.5 Sonnet model from Anthropic
  const result = await Factuality({
    model: "claude-3-5-sonnet-latest",
    output,
    expected,
    input,
  });

  // The evaluator returns a score from [0,1] and includes the raw outputs from the evaluator
  console.log(`Factuality score: ${result.score}`);
  console.log(`Factuality metadata: ${result.metadata?.rationale}`);
})();

Custom client configuration

There are two ways you can configure a custom client when you need to use a different OpenAI compatible API:

Global configuration: Initialize a client that will be used by all evaluators
Instance configuration: Configure a client for a specific evaluator

Global configuration

Set up a client that all your evaluators will use:

Python

import openai
import asyncio
from autoevals import init
from autoevals.llm import Factuality

client = init(openai.AsyncOpenAI(base_url="https://api.openai.com/v1/"))

async def main():
    evaluator = Factuality()
    result = await evaluator.eval_async(
        input="What is the speed of light in a vacuum?",
        output="The speed of light in a vacuum is 299,792,458 meters per second.",
        expected="The speed of light in a vacuum is approximately 300,000 kilometers per second."
    )
    print(f"Factuality score: {result.score}")

asyncio.run(main())

TypeScript

import OpenAI from "openai";
import { init, Factuality } from "autoevals";

const client = new OpenAI({
  baseURL: "https://api.openai.com/v1/",
});

init({ client });

(async () => {
  const result = await Factuality({
    input: "What is the speed of light in a vacuum?",
    output: "The speed of light in a vacuum is 299,792,458 meters per second.",
    expected:
      "The speed of light in a vacuum is approximately 300,000 kilometers per second (or precisely 299,792,458 meters per second).",
  });

  console.log("Factuality Score:", result);
})();

Instance configuration

Configure a client for a specific evaluator instance:

Python

import openai
from autoevals.llm import Factuality

custom_client = openai.OpenAI(base_url="https://custom-api.example.com/v1/")
evaluator = Factuality(client=custom_client)

TypeScript

import OpenAI from "openai";
import { Factuality } from "autoevals";

(async () => {
  const customClient = new OpenAI({
    baseURL: "https://custom-api.example.com/v1/",
  });

  const result = await Factuality({
    client: customClient,
    output: "Paris is the capital of France",
    expected:
      "Paris is the capital of France and has a population of over 2 million",
    input: "Tell me about Paris",
  });
  console.log(result);
})();

Using Braintrust with Autoevals (optional)

Once you grade an output using Autoevals, you can optionally use Braintrust to log and compare your evaluation results. This integration is completely optional and not required for using Autoevals.

TypeScript

Create a file named example.eval.js (it must take the form *.eval.[ts|tsx|js|jsx]):

import { Eval } from "braintrust";
import { Factuality } from "autoevals";

Eval("Autoevals", {
  data: () => [
    {
      input: "Which country has the highest population?",
      expected: "China",
    },
  ],
  task: () => "People's Republic of China",
  scores: [Factuality],
});

Then, run

npx braintrust run example.eval.js

Python

Create a file named eval_example.py (it must take the form eval_*.py):

import braintrust
from autoevals.llm import Factuality

Eval(
    "Autoevals",
    data=lambda: [
        dict(
            input="Which country has the highest population?",
            expected="China",
        ),
    ],
    task=lambda *args: "People's Republic of China",
    scores=[Factuality],
)

Supported evaluation methods

LLM-as-a-judge evaluations

Battle
Closed QA
Humor
Factuality
Moderation
Security
Summarization
SQL
Translation
Fine-tuned binary classifiers

RAG evaluations

Context precision
Context relevancy
Context recall
Context entity recall
Faithfulness
Answer relevancy
Answer similarity
Answer correctness

Composite evaluations

Semantic list contains
JSON validity

Embedding evaluations

Embedding similarity

Heuristic evaluations

Levenshtein distance
Exact match
Numeric difference
JSON diff

Custom evaluation prompts

Autoevals supports custom evaluation prompts for model-graded evaluation. To use them, simply pass in a prompt and scoring mechanism:

Python

from autoevals import LLMClassifier

# Define a prompt prefix for a LLMClassifier (returns just one answer)
prompt_prefix = """
You are a technical project manager who helps software engineers generate better titles for their GitHub issues.
You will look at the issue description, and pick which of two titles better describes it.

I'm going to provide you with the issue description, and two possible titles.

Issue Description: {{input}}

1: {{output}}
2: {{expected}}
"""

# Define the scoring mechanism
# 1 if the generated answer is better than the expected answer
# 0 otherwise
output_scores = {"1": 1, "2": 0}

evaluator = LLMClassifier(
    name="TitleQuality",
    prompt_template=prompt_prefix,
    choice_scores=output_scores,
    use_cot=True,
)

# Evaluate an example LLM completion
page_content = """
As suggested by Nicolo, we should standardize the error responses coming from GoTrue, postgres, and realtime (and any other/future APIs) so that it's better DX when writing a client,
We can make this change on the servers themselves, but since postgrest and gotrue are fully/partially external may be harder to change, it might be an option to transform the errors within the client libraries/supabase-js, could be messy?
Nicolo also dropped this as a reference: http://spec.openapis.org/oas/v3.0.3#openapi-specification"""
output = "Standardize error responses from GoTrue, Postgres, and Realtime APIs for better DX"
expected = "Standardize Error Responses across APIs"

response = evaluator(output, expected, input=page_content)

print(f"Score: {response.score}")
print(f"Metadata: {response.metadata}")

TypeScript

import { LLMClassifierFromTemplate } from "autoevals";

(async () => {
  const promptTemplate = `You are a technical project manager who helps software engineers generate better titles for their GitHub issues.
You will look at the issue description, and pick which of two titles better describes it.

I'm going to provide you with the issue description, and two possible titles.

Issue Description: {{input}}

1: {{output}}
2: {{expected}}`;

  const choiceScores = { 1: 1, 2: 0 };

  const evaluator = LLMClassifierFromTemplate<{ input: string }>({
    name: "TitleQuality",
    promptTemplate,
    choiceScores,
    useCoT: true,
  });

  const input = `As suggested by Nicolo, we should standardize the error responses coming from GoTrue, postgres, and realtime (and any other/future APIs) so that it's better DX when writing a client,
We can make this change on the servers themselves, but since postgrest and gotrue are fully/partially external may be harder to change, it might be an option to transform the errors within the client libraries/supabase-js, could be messy?
Nicolo also dropped this as a reference: http://spec.openapis.org/oas/v3.0.3#openapi-specification`;
  const output = `Standardize error responses from GoTrue, Postgres, and Realtime APIs for better DX`;
  const expected = `Standardize Error Responses across APIs`;

  const response = await evaluator({ input, output, expected });

  console.log("Score", response.score);
  console.log("Metadata", response.metadata);
})();

Creating custom scorers

You can also create your own scoring functions that do not use LLMs. For example, to test whether the word 'banana' is in the output, you can use the following:

Python

from autoevals import Score

def banana_scorer(output, expected, input):
    return Score(name="banana_scorer", score=1 if "banana" in output else 0)

input = "What is 1 banana + 2 bananas?"
output = "3"
expected = "3 bananas"

result = banana_scorer(output, expected, input)

print(f"Banana score: {result.score}")

TypeScript

import { Score } from "autoevals";

const bananaScorer = ({
  output,
  expected,
  input,
}: {
  output: string;
  expected: string;
  input: string;
}): Score => {
  return { name: "banana_scorer", score: output.includes("banana") ? 1 : 0 };
};

(async () => {
  const input = "What is 1 banana + 2 bananas?";
  const output = "3";
  const expected = "3 bananas";

  const result = bananaScorer({ output, expected, input });
  console.log(`Banana score: ${result.score}`);
})();

Why does this library exist?

There is nothing particularly novel about the evaluation methods in this library. They are all well-known and well-documented. However, there are a few things that are particularly difficult when evaluating in practice:

Normalizing metrics between 0 and 1 is tough. For example, check out the calculation in number.py to see how it's done for numeric differences.
Parsing the outputs on model-graded evaluations is also challenging. There are frameworks that do this, but it's hard to debug one output at a time, propagate errors, and tweak the prompts. Autoevals makes these tasks easy.
Collecting metrics behind a uniform interface makes it easy to swap out evaluation methods and compare them. Prior to Autoevals, we couldn't find an open source library where you can simply pass in input, output, and expected values through a bunch of different evaluation methods.

Documentation

The full docs are available for your reference.

Contributing

We welcome contributions!

To install the development dependencies, run make develop, and run source env.sh to activate the environment. Make a .env file from the .env.example file and set the environment variables. Run direnv allow to load the environment variables.

To run the tests, run pytest from the root directory.

Send a PR and we'll review it! We'll take care of versioning and releasing.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.0.129

May 13, 2025

0.0.127

Apr 9, 2025

0.0.126

Mar 25, 2025

0.0.125

Mar 25, 2025

0.0.124

Mar 19, 2025

0.0.123

Mar 10, 2025

0.0.122

Mar 7, 2025

0.0.121

Mar 1, 2025

0.0.120

Feb 25, 2025

0.0.119

Feb 1, 2025

0.0.118

Jan 24, 2025

0.0.117

Jan 16, 2025

0.0.116

Jan 15, 2025

0.0.115

Jan 11, 2025

0.0.114

Jan 10, 2025

0.0.113

Jan 7, 2025

0.0.112

Jan 3, 2025

0.0.111

Dec 16, 2024

0.0.110

Dec 13, 2024

0.0.109

Dec 10, 2024

0.0.108

Dec 2, 2024

0.0.107

Nov 25, 2024

0.0.106

Nov 21, 2024

0.0.105

Nov 19, 2024

0.0.104

Nov 15, 2024

0.0.103

Nov 12, 2024

0.0.101

Nov 4, 2024

0.0.100

Oct 27, 2024

0.0.99

Oct 18, 2024

0.0.98

Oct 16, 2024

0.0.97

Oct 15, 2024

0.0.96

Oct 11, 2024

0.0.95

Oct 9, 2024

0.0.94

Oct 1, 2024

0.0.93

Oct 1, 2024

0.0.92

Sep 26, 2024

0.0.91

Sep 24, 2024

0.0.90

Sep 19, 2024

0.0.89

Sep 4, 2024

0.0.88

Sep 4, 2024

0.0.87

Aug 30, 2024

0.0.86

Aug 26, 2024

0.0.85

Aug 7, 2024

0.0.84

Aug 7, 2024

0.0.83

Aug 2, 2024

0.0.82

Aug 1, 2024

0.0.81

Jul 23, 2024

0.0.80

Jul 19, 2024

0.0.79

Jul 17, 2024

0.0.78

Jul 17, 2024

0.0.77

Jul 17, 2024

0.0.76

Jul 9, 2024

0.0.75

Jul 5, 2024

0.0.74

Jul 1, 2024

0.0.73

Jun 29, 2024

0.0.72

Jun 25, 2024

0.0.71

Jun 18, 2024

0.0.70

Jun 13, 2024

0.0.69

Jun 12, 2024

0.0.68

May 28, 2024

0.0.67

May 24, 2024

0.0.66

May 24, 2024

0.0.65

May 18, 2024

0.0.64

Apr 26, 2024

0.0.63

Apr 25, 2024

0.0.62

Apr 24, 2024

0.0.61

Apr 17, 2024

0.0.60

Apr 16, 2024

0.0.59

Apr 16, 2024

0.0.58

Apr 16, 2024

0.0.57

Apr 16, 2024

0.0.56

Apr 9, 2024

0.0.55

Apr 2, 2024

0.0.54

Mar 28, 2024

0.0.53

Mar 17, 2024

0.0.52

Mar 17, 2024

0.0.51

Mar 14, 2024

0.0.50

Mar 7, 2024

0.0.49

Mar 5, 2024

0.0.48

Feb 25, 2024

0.0.47

Feb 23, 2024

0.0.46

Feb 4, 2024

0.0.45

Jan 23, 2024

0.0.44

Jan 11, 2024

0.0.43

Jan 11, 2024

0.0.42

Jan 10, 2024

0.0.41

Dec 31, 2023

0.0.40

Dec 18, 2023

0.0.39

Dec 18, 2023

0.0.38

Dec 16, 2023

0.0.37

Dec 16, 2023

0.0.36

Dec 15, 2023

0.0.35

Dec 15, 2023

0.0.34

Dec 6, 2023

0.0.33

Dec 5, 2023

0.0.32

Nov 28, 2023

0.0.31

Nov 10, 2023

0.0.30

Nov 9, 2023

0.0.29

Nov 8, 2023

0.0.28

Nov 3, 2023

0.0.27

Nov 3, 2023

0.0.26

Oct 17, 2023

0.0.25

Oct 13, 2023

0.0.24

Oct 12, 2023

0.0.23

Oct 11, 2023

0.0.22

Sep 20, 2023

0.0.21

Sep 13, 2023

0.0.20

Sep 13, 2023

0.0.19

Sep 13, 2023

0.0.18

Sep 6, 2023

0.0.17

Sep 6, 2023

0.0.16

Sep 5, 2023

0.0.15

Sep 5, 2023

0.0.14

Aug 24, 2023

0.0.13

Aug 18, 2023

0.0.12

Aug 18, 2023

0.0.11

Aug 17, 2023

0.0.10

Aug 17, 2023

0.0.9

Aug 16, 2023

0.0.8

Aug 5, 2023

0.0.7

Jul 26, 2023

0.0.6

Jul 26, 2023

0.0.5

Jul 24, 2023

0.0.4

Jul 12, 2023

0.0.3

Jul 11, 2023

0.0.2

Jul 11, 2023

0.0.1

Jul 10, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

autoevals-0.0.129.tar.gz (50.0 kB view details)

Uploaded May 13, 2025 Source

Built Distribution

autoevals-0.0.129-py3-none-any.whl (53.5 kB view details)

Uploaded May 13, 2025 Python 3

File details

Details for the file autoevals-0.0.129.tar.gz.

File metadata

Download URL: autoevals-0.0.129.tar.gz
Upload date: May 13, 2025
Size: 50.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.10

File hashes

Hashes for autoevals-0.0.129.tar.gz
Algorithm	Hash digest
SHA256	`b7a6e45f8d4dd2bec0666602c78515b2f2c9f1a5c2a6b6275ad6cc3cac63e348`
MD5	`ceb425da9fcfb717e9a0ce490b876f51`
BLAKE2b-256	`31269a8d3b0e1ecbc22f8d7c1a44aa748660e846d6acb321eba4da620e08bf3c`

See more details on using hashes here.

File details

Details for the file autoevals-0.0.129-py3-none-any.whl.

File metadata

Download URL: autoevals-0.0.129-py3-none-any.whl
Upload date: May 13, 2025
Size: 53.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.10

File hashes

Hashes for autoevals-0.0.129-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7240e4e4bf1843bb5bc688b71fe2c6159596d3b5891bf34576941f17e04fe3ba`
MD5	`e7d11640201ad9a8c4bd62bd2e46f968`
BLAKE2b-256	`7b621a85254ab1e733270a61dcec18e01f102c11016520316e89122478e7d527`

See more details on using hashes here.

autoevals 0.0.129

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Autoevals

Requirements

Installation

TypeScript

Python

Getting started

Python

TypeScript

Using other AI providers

Python

TypeScript

Custom client configuration

Global configuration

Python

TypeScript

Instance configuration

Python

TypeScript

Using Braintrust with Autoevals (optional)

TypeScript

Python

Supported evaluation methods

LLM-as-a-judge evaluations

RAG evaluations

Composite evaluations

Embedding evaluations

Heuristic evaluations

Custom evaluation prompts

Python

TypeScript

Creating custom scorers

Python

TypeScript

Why does this library exist?

Documentation

Contributing

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes