Skip to main content

Universal library for evaluating AI models

Project description

AutoEvals

AutoEvals is a tool to quickly and easily evaluate AI model outputs.

It bundles together a variety of automatic evaluation methods including:

  • Heuristic (e.g. Levenshtein distance)
  • Statistical (e.g. BLEU)
  • Model-based (using LLMs)

AutoEvals is developed by the team at BrainTrust.

AutoEvals uses model-graded evaluation for a variety of subjective tasks including fact checking, safety, and more. Many of these evaluations are adapted from OpenAI's excellent evals project but are implemented so you can flexibly run them on individual examples, tweak the prompts, and debug their outputs.

You can also create your own model-graded evaluations with AutoEvals. It's easy to add custom prompts, parse outputs, and manage exceptions.

Installation

AutoEvals is distributed as a Python library on PyPI and Node.js library on NPM.

pip install autoevals

Example

Use AutoEvals to model-grade an example LLM completion using the factuality prompt.

from autoevals.llm import *

# Create a new LLM-based evaluator
evaluator = Factuality()

# Evaluate an example LLM completion
input = "Which country has the highest population?"
output = "People's Republic of China"
expected = "China"

result = evaluator(output, expected, input=input)

# The evaluator returns a score from [0,1] and includes the raw outputs from the evaluator
print(f"Factuality score: {result.score}")
print(f"Factuality metadata: {result.metadata['rationale']}")

Using Braintrust with AutoEvals

Once you grade an output using AutoEvals, it's convenient to use BrainTrust to log and compare your evaluation results.

from autoevals.llm import *
import braintrust

# Create a new LLM-based evaluator
evaluator = Factuality()

# Evaluate an example LLM completion
input = "Which country has the highest population?"
output = "People's Republic of China"
expected = "China"

result = evaluator(output, expected, input=input)

# The evaluator returns a score from [0,1] and includes the raw outputs from the evaluator
print(f"Factuality score: {result.score}")
print(f"Factuality metadata: {result.metadata['rationale']}")

# Log the evaluation results to BrainTrust
experiment = braintrust.init(
    project="AutoEvals", api_key="YOUR_BRAINTRUST_API_KEY"
)
experiment.log(
    inputs={"query": input},
    output=output,
    expected=expected,
    scores={
        "factuality": result.score,
    },
    metadata={
        "factuality": result.metadata,
    },
)
print(experiment.summarize())

Supported Evaluation Methods

Model-Based Classification

  • Battle
  • ClosedQA
  • Humor
  • Factuality
  • Security
  • Summarization
  • SQL
  • Translation
  • Fine-tuned binary classifiers

Embeddings

  • BERTScore
  • Ada Embedding distance

Heuristic

  • Levenshtein distance
  • Jaccard distance
  • JSON diff

Statistical

  • BLEU
  • ROUGE
  • METEOR

Custom Evaluation Prompts

AutoEvals supports custom evaluation prompts for model-graded evaluation. To use them, simply pass in a prompt and scoring mechanism:

from autoevals import LLMClassifier

# Define a prompt prefix for a LLMClassifier (returns just one answer)
prompt_prefix = """
You are a technical project manager who helps software engineers generate better titles for their GitHub issues.
You will look at the issue description, and pick which of two titles better describes it.

I'm going to provide you with the issue description, and two possible titles.

Issue Description: {{input}}

1: {{output}}
2: {{expected}}
"""

# Define the scoring mechanism
# 1 if the generated answer is better than the expected answer
# 0 otherwise
output_scores = {"1": 1, "2": 0}

evaluator = LLMClassifier(
    prompt_prefix,
    output_scores,
    use_cot=False,
)

# Evaluate an example LLM completion
page_content = """
As suggested by Nicolo, we should standardize the error responses coming from GoTrue, postgres, and realtime (and any other/future APIs) so that it's better DX when writing a client,
We can make this change on the servers themselves, but since postgrest and gotrue are fully/partially external may be harder to change, it might be an option to transform the errors within the client libraries/supabase-js, could be messy?
Nicolo also dropped this as a reference: http://spec.openapis.org/oas/v3.0.3#openapi-specification"""
output = (
    "Standardize error responses from GoTrue, Postgres, and Realtime APIs for better DX"
)
expected = "Standardize Error Responses across APIs"

response = evaluator(output, expected, input=page_content)

print(f"Score: {response.score}")
print(f"Metadata: {response.metadata}")

Documentation

The full docs are available here.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

autoevals-0.0.14.tar.gz (15.4 kB view details)

Uploaded Source

Built Distribution

autoevals-0.0.14-py3-none-any.whl (16.0 kB view details)

Uploaded Python 3

File details

Details for the file autoevals-0.0.14.tar.gz.

File metadata

  • Download URL: autoevals-0.0.14.tar.gz
  • Upload date:
  • Size: 15.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.17

File hashes

Hashes for autoevals-0.0.14.tar.gz
Algorithm Hash digest
SHA256 74d2e14bcba0cefab33bcd8f117c619f9e642574bd88c811468a4dc35051a7d3
MD5 b6220eb9656677a74925c9f4c16f6629
BLAKE2b-256 f2e35add20aeeca5da82d578860da3c4449f71216da923ee54c4b9b916f46754

See more details on using hashes here.

File details

Details for the file autoevals-0.0.14-py3-none-any.whl.

File metadata

  • Download URL: autoevals-0.0.14-py3-none-any.whl
  • Upload date:
  • Size: 16.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.17

File hashes

Hashes for autoevals-0.0.14-py3-none-any.whl
Algorithm Hash digest
SHA256 48b8bf15189747d4c957fb923e6a383ad2d912c40a543ef88ccfda83900d22ac
MD5 33794631f6c2fe0b6c2cedcb6cf29881
BLAKE2b-256 91479b698241347bb6ed529341a55d0f283ed37c43770fddfdc2e781f12e26a0

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page