Skip to main content

Universal library for evaluating AI models

Project description

AutoEvals

AutoEvals is a tool to quickly and easily evaluate AI model outputs.

It bundles together a variety of automatic evaluation methods including:

  • Heuristic (e.g. Levenshtein distance)
  • Statistical (e.g. BLEU)
  • Model-based (using LLMs)

AutoEvals is developed by the team at BrainTrust.

AutoEvals uses model-graded evaluation for a variety of subjective tasks including fact checking, safety, and more. Many of these evaluations are adapted from OpenAI's excellent evals project but are implemented so you can flexibly run them on individual examples, tweak the prompts, and debug their outputs.

You can also create your own model-graded evaluations with AutoEvals. It's easy to add custom prompts, parse outputs, and manage exceptions.

Installation

AutoEvals is distributed as a Python library on PyPI and Node.js library on NPM.

Python

pip install autoevals

Node.js

npm install autoevals

Example

Use AutoEvals to model-grade an example LLM completion using the factuality prompt.

Python

from autoevals.llm import *

# Create a new LLM-based evaluator
evaluator = Factuality()

# Evaluate an example LLM completion
input = "Which country has the highest population?"
output = "People's Republic of China"
expected = "China"

result = evaluator(output, expected, input=input)

# The evaluator returns a score from [0,1] and includes the raw outputs from the evaluator
print(f"Factuality score: {result.score}")
print(f"Factuality metadata: {result.metadata['rationale']}")

Node.js

import { Factuality } from "autoevals";

(async () => {
  const input = "Which country has the highest population?";
  const output = "People's Republic of China";
  const expected = "China";

  const result = await Factuality({ output, expected, input });
  console.log(`Factuality score: ${result.score}`);
  console.log(`Factuality metadata: ${result.metadata.rationale}`);
})();

Using Braintrust with AutoEvals

Once you grade an output using AutoEvals, it's convenient to use BrainTrust to log and compare your evaluation results.

Python

from autoevals.llm import *
import braintrust

# Create a new LLM-based evaluator
evaluator = Factuality()

# Evaluate an example LLM completion
input = "Which country has the highest population?"
output = "People's Republic of China"
expected = "China"

result = evaluator(output, expected, input=input)

# The evaluator returns a score from [0,1] and includes the raw outputs from the evaluator
print(f"Factuality score: {result.score}")
print(f"Factuality metadata: {result.metadata['rationale']}")

# Log the evaluation results to BrainTrust
experiment = braintrust.init(
    project="AutoEvals", api_key="YOUR_BRAINTRUST_API_KEY"
)
experiment.log(
    inputs={"query": input},
    output=output,
    expected=expected,
    scores={
        "factuality": result.score,
    },
    metadata={
        "factuality": result.metadata,
    },
)
print(experiment.summarize())

Node.js

Create a file named example.eval.js (it must end with .eval.js or .eval.js):

import { Eval } from "braintrust";
import { Factuality } from "autoevals";

Eval("AutoEvals", {
  data: () => [
    {
      input: "Which country has the highest population?",
      expected: "China",
    },
  ],
  task: () => "People's Republic of China",
  scores: [Factuality],
});

Then, run

npx braintrust run example.eval.js

Supported Evaluation Methods

Model-Based Classification

  • Battle
  • ClosedQA
  • Humor
  • Factuality
  • Security
  • Summarization
  • SQL
  • Translation
  • Fine-tuned binary classifiers

Embeddings

  • Embedding similarity
  • BERTScore

Heuristic

  • Levenshtein distance
  • Numeric difference
  • JSON diff
  • Jaccard distance

Statistical

  • BLEU
  • ROUGE
  • METEOR

Custom Evaluation Prompts

AutoEvals supports custom evaluation prompts for model-graded evaluation. To use them, simply pass in a prompt and scoring mechanism:

Python

from autoevals import LLMClassifier

# Define a prompt prefix for a LLMClassifier (returns just one answer)
prompt_prefix = """
You are a technical project manager who helps software engineers generate better titles for their GitHub issues.
You will look at the issue description, and pick which of two titles better describes it.

I'm going to provide you with the issue description, and two possible titles.

Issue Description: {{input}}

1: {{output}}
2: {{expected}}
"""

# Define the scoring mechanism
# 1 if the generated answer is better than the expected answer
# 0 otherwise
output_scores = {"1": 1, "2": 0}

evaluator = LLMClassifier(
    prompt_prefix,
    output_scores,
    use_cot=False,
)

# Evaluate an example LLM completion
page_content = """
As suggested by Nicolo, we should standardize the error responses coming from GoTrue, postgres, and realtime (and any other/future APIs) so that it's better DX when writing a client,
We can make this change on the servers themselves, but since postgrest and gotrue are fully/partially external may be harder to change, it might be an option to transform the errors within the client libraries/supabase-js, could be messy?
Nicolo also dropped this as a reference: http://spec.openapis.org/oas/v3.0.3#openapi-specification"""
output = (
    "Standardize error responses from GoTrue, Postgres, and Realtime APIs for better DX"
)
expected = "Standardize Error Responses across APIs"

response = evaluator(output, expected, input=page_content)

print(f"Score: {response.score}")
print(f"Metadata: {response.metadata}")

Node.js

import { LLMClassifierFromTemplate } from "autoevals";

(async () => {
  const promptTemplate = `You are a technical project manager who helps software engineers generate better titles for their GitHub issues.
You will look at the issue description, and pick which of two titles better describes it.

I'm going to provide you with the issue description, and two possible titles.

Issue Description: {{input}}

1: {{output}}
2: {{expected}}`;

  const choiceScores = { 1: 1, 2: 0 };

  const evaluator = LLMClassifierFromTemplate({
    promptTemplate,
    choiceScores,
    useCoT: false,
  });

  const input = `As suggested by Nicolo, we should standardize the error responses coming from GoTrue, postgres, and realtime (and any other/future APIs) so that it's better DX when writing a client,
We can make this change on the servers themselves, but since postgrest and gotrue are fully/partially external may be harder to change, it might be an option to transform the errors within the client libraries/supabase-js, could be messy?
Nicolo also dropped this as a reference: http://spec.openapis.org/oas/v3.0.3#openapi-specification`;
  const output = `Standardize error responses from GoTrue, Postgres, and Realtime APIs for better DX`;
  const expected = `Standardize Error Responses across APIs`;

  const response = await evaluator({ input, output, expected });

  console.log("Score", response.score);
  console.log("Metadata", response.metadata);
})();

Why does this library exist?

There is nothing particularly novel about the evaluation methods in this library. They are all well-known and well-documented. However, there are a few things that are particularly difficult when evaluating in practice:

  • Normalizing metrics between 0 and 1 is tough. For example, check out the calculation in number.py to see how it's done for numeric differences.
  • Parsing the outputs on model-graded evaluations is also challenging. There are frameworks that do this, but it's hard to debug one output at a time, propagate errors, and tweak the prompts. AutoEvals makes these tasks easy.
  • Collecting metrics behind a uniform interface makes it easy to swap out evaluation methods and compare them. Prior to AutoEvals, we couldn't find an open source library where you can simply pass in input, output, and expected values through a bunch of different evaluation methods.

Documentation

The full docs are available here.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

autoevals-0.0.41.tar.gz (17.5 kB view details)

Uploaded Source

Built Distribution

autoevals-0.0.41-py3-none-any.whl (19.8 kB view details)

Uploaded Python 3

File details

Details for the file autoevals-0.0.41.tar.gz.

File metadata

  • Download URL: autoevals-0.0.41.tar.gz
  • Upload date:
  • Size: 17.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.9.6 readme-renderer/42.0 requests/2.31.0 requests-toolbelt/1.0.0 urllib3/1.26.18 tqdm/4.66.1 importlib-metadata/6.8.0 keyring/24.2.0 rfc3986/1.5.0 colorama/0.4.6 CPython/3.11.0rc1

File hashes

Hashes for autoevals-0.0.41.tar.gz
Algorithm Hash digest
SHA256 13003e6da5bc76850c80ec37eb51a225216f313d0fe2438138a1d40b18b2b956
MD5 e6f164d6f6c7b0386bb7d8332421cb06
BLAKE2b-256 4f6d6ccfe8ec17438348a870c95745fd9789c2e2448e2c538b3a71574d9290aa

See more details on using hashes here.

File details

Details for the file autoevals-0.0.41-py3-none-any.whl.

File metadata

  • Download URL: autoevals-0.0.41-py3-none-any.whl
  • Upload date:
  • Size: 19.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.9.6 readme-renderer/42.0 requests/2.31.0 requests-toolbelt/1.0.0 urllib3/1.26.18 tqdm/4.66.1 importlib-metadata/6.8.0 keyring/24.2.0 rfc3986/1.5.0 colorama/0.4.6 CPython/3.11.0rc1

File hashes

Hashes for autoevals-0.0.41-py3-none-any.whl
Algorithm Hash digest
SHA256 39cc3663f5d657481f69041419386709b6b74e5c9ba3207915a68f0baefeaf04
MD5 01435a8df0db982f43804e62c3267746
BLAKE2b-256 182796fa36e31a5f6c2a1161ec0255ae80e150226975c11f89c384ec50b99846

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page