Skip to main content

Universal library for evaluating AI models

Project description

AutoEvals

AutoEvals is a tool to quickly and easily evaluate AI model outputs.

It bundles together a variety of automatic evaluation methods including:

  • Heuristic (e.g. Levenshtein distance)
  • Statistical (e.g. BLEU)
  • Model-based (using LLMs)

AutoEvals is developed by the team at BrainTrust.

AutoEvals uses model-graded evaluation for a variety of subjective tasks including fact checking, safety, and more. Many of these evaluations are adapted from OpenAI's excellent evals project but are implemented so you can flexibly run them on individual examples, tweak the prompts, and debug their outputs.

You can also create your own model-graded evaluations with AutoEvals. It's easy to add custom prompts, parse outputs, and manage exceptions.

Installation

AutoEvals is distributed as a Python library on PyPI and Node.js library on NPM.

pip install autoevals

Example

Use AutoEvals to model-grade an example LLM completion using the factuality prompt.

from autoevals.llm import *

# Create a new LLM-based evaluator
evaluator = Factuality()

# Evaluate an example LLM completion
input = "Which country has the highest population?"
output = "People's Republic of China"
expected = "China"

result = evaluator(output, expected, input=input)

# The evaluator returns a score from [0,1] and includes the raw outputs from the evaluator
print(f"Factuality score: {result.score}")
print(f"Factuality metadata: {result.metadata['rationale']}")

Using Braintrust with AutoEvals

Once you grade an output using AutoEvals, it's convenient to use BrainTrust to log and compare your evaluation results.

from autoevals.llm import *
import braintrust

# Create a new LLM-based evaluator
evaluator = Factuality()

# Evaluate an example LLM completion
input = "Which country has the highest population?"
output = "People's Republic of China"
expected = "China"

result = evaluator(output, expected, input=input)

# The evaluator returns a score from [0,1] and includes the raw outputs from the evaluator
print(f"Factuality score: {result.score}")
print(f"Factuality metadata: {result.metadata['rationale']}")

# Log the evaluation results to BrainTrust
experiment = braintrust.init(
    project="AutoEvals", api_key="YOUR_BRAINTRUST_API_KEY"
)
experiment.log(
    inputs={"query": input},
    output=output,
    expected=expected,
    scores={
        "factuality": result.score,
    },
    metadata={
        "factuality": result.metadata,
    },
)
print(experiment.summarize())

Supported Evaluation Methods

Model-Based Classification

  • Battle
  • ClosedQA
  • Humor
  • Factuality
  • Security
  • Summarization
  • SQL
  • Translation
  • Fine-tuned binary classifiers

Embeddings

  • BERTScore
  • Ada Embedding distance

Heuristic

  • Levenshtein distance
  • Jaccard distance
  • JSON diff

Statistical

  • BLEU
  • ROUGE
  • METEOR

Custom Evaluation Prompts

AutoEvals supports custom evaluation prompts for model-graded evaluation. To use them, simply pass in a prompt and scoring mechanism:

from autoevals import LLMClassifier

# Define a prompt prefix for a LLMClassifier (returns just one answer)
prompt_prefix = """
You are a technical project manager who helps software engineers generate better titles for their GitHub issues.
You will look at the issue description, and pick which of two titles better describes it.

I'm going to provide you with the issue description, and two possible titles.

Issue Description: {{input}}

1: {{output}}
2: {{expected}}
"""

# Define the scoring mechanism
# 1 if the generated answer is better than the expected answer
# 0 otherwise
output_scores = {"1": 1, "2": 0}

evaluator = LLMClassifier(
    prompt_prefix,
    output_scores,
    use_cot=False,
)

# Evaluate an example LLM completion
page_content = """
As suggested by Nicolo, we should standardize the error responses coming from GoTrue, postgres, and realtime (and any other/future APIs) so that it's better DX when writing a client,
We can make this change on the servers themselves, but since postgrest and gotrue are fully/partially external may be harder to change, it might be an option to transform the errors within the client libraries/supabase-js, could be messy?
Nicolo also dropped this as a reference: http://spec.openapis.org/oas/v3.0.3#openapi-specification"""
output = (
    "Standardize error responses from GoTrue, Postgres, and Realtime APIs for better DX"
)
expected = "Standardize Error Responses across APIs"

response = evaluator(output, expected, input=page_content)

print(f"Score: {response.score}")
print(f"Metadata: {response.metadata}")

Documentation

The full docs are available here.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

autoevals-0.0.17.tar.gz (15.4 kB view details)

Uploaded Source

Built Distribution

autoevals-0.0.17-py3-none-any.whl (16.0 kB view details)

Uploaded Python 3

File details

Details for the file autoevals-0.0.17.tar.gz.

File metadata

  • Download URL: autoevals-0.0.17.tar.gz
  • Upload date:
  • Size: 15.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.17

File hashes

Hashes for autoevals-0.0.17.tar.gz
Algorithm Hash digest
SHA256 99409a8251d892be753afa995937c299640d79534069f780f910db08158003ac
MD5 ef1f6d6ad577f6b900852e3bb5a030b5
BLAKE2b-256 dd0f258f931479697cfd0cfbead529d80eb6cccbc3bb5de59f053f45a01cbefd

See more details on using hashes here.

File details

Details for the file autoevals-0.0.17-py3-none-any.whl.

File metadata

  • Download URL: autoevals-0.0.17-py3-none-any.whl
  • Upload date:
  • Size: 16.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.17

File hashes

Hashes for autoevals-0.0.17-py3-none-any.whl
Algorithm Hash digest
SHA256 2b9acc95ba8815e36d6c6d75dc058156a3f1920f714f1f17a11fff425dd1de7c
MD5 e0e6f1f5c288b75bd8508c72dca496d0
BLAKE2b-256 1effdf9d5c66a10f57dd3e4a85e6535ddead82d7fb5acbd58f48c2cba367b821

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page