Skip to main content

Universal library for evaluating AI models

Project description

AutoEvals

AutoEvals is a tool to quickly and easily evaluate AI model outputs.

It bundles together a variety of automatic evaluation methods including:

  • Heuristic (e.g. Levenshtein distance)
  • Statistical (e.g. BLEU)
  • Model-based (using LLMs)

AutoEvals is developed by the team at BrainTrust.

AutoEvals uses model-graded evaluation for a variety of subjective tasks including fact checking, safety, and more. Many of these evaluations are adapted from OpenAI's excellent evals project but are implemented so you can flexibly run them on individual examples, tweak the prompts, and debug their outputs.

You can also create your own model-graded evaluations with AutoEvals. It's easy to add custom prompts, parse outputs, and manage exceptions.

Installation

AutoEvals is distributed as a Python library on PyPI and Node.js library on NPM.

pip install autoevals

Example

Use AutoEvals to model-grade an example LLM completion using the factuality prompt.

from autoevals.llm import *

# Create a new LLM-based evaluator
evaluator = Factuality()

# Evaluate an example LLM completion
input = "Which country has the highest population?"
output = "People's Republic of China"
expected = "China"

result = evaluator(output, expected, input=input)

# The evaluator returns a score from [0,1] and includes the raw outputs from the evaluator
print(f"Factuality score: {result.score}")
print(f"Factuality metadata: {result.metadata['rationale']}")

Using Braintrust with AutoEvals

Once you grade an output using AutoEvals, it's convenient to use BrainTrust to log and compare your evaluation results.

from autoevals.llm import *
import braintrust

# Create a new LLM-based evaluator
evaluator = Factuality()

# Evaluate an example LLM completion
input = "Which country has the highest population?"
output = "People's Republic of China"
expected = "China"

result = evaluator(output, expected, input=input)

# The evaluator returns a score from [0,1] and includes the raw outputs from the evaluator
print(f"Factuality score: {result.score}")
print(f"Factuality metadata: {result.metadata['rationale']}")

# Log the evaluation results to BrainTrust
experiment = braintrust.init(
    project="AutoEvals", api_key="YOUR_BRAINTRUST_API_KEY"
)
experiment.log(
    inputs={"query": input},
    output=output,
    expected=expected,
    scores={
        "factuality": result.score,
    },
    metadata={
        "factuality": result.metadata,
    },
)
print(experiment.summarize())

Supported Evaluation Methods

Model-Based Classification

  • Battle
  • ClosedQA
  • Humor
  • Factuality
  • Security
  • Summarization
  • SQL
  • Translation
  • Fine-tuned binary classifiers

Embeddings

  • BERTScore
  • Ada Embedding distance

Heuristic

  • Levenshtein distance
  • Jaccard distance
  • JSON diff

Statistical

  • BLEU
  • ROUGE
  • METEOR

Custom Evaluation Prompts

AutoEvals supports custom evaluation prompts for model-graded evaluation. To use them, simply pass in a prompt and scoring mechanism:

from autoevals import LLMClassifier

# Define a prompt prefix for a LLMClassifier (returns just one answer)
prompt_prefix = """
You are a technical project manager who helps software engineers generate better titles for their GitHub issues.
You will look at the issue description, and pick which of two titles better describes it.

I'm going to provide you with the issue description, and two possible titles.

Issue Description: {{input}}

1: {{output}}
2: {{expected}}
"""

# Define the scoring mechanism
# 1 if the generated answer is better than the expected answer
# 0 otherwise
output_scores = {"1": 1, "2": 0}

evaluator = LLMClassifier(
    prompt_prefix,
    output_scores,
    use_cot=False,
)

# Evaluate an example LLM completion
page_content = """
As suggested by Nicolo, we should standardize the error responses coming from GoTrue, postgres, and realtime (and any other/future APIs) so that it's better DX when writing a client,
We can make this change on the servers themselves, but since postgrest and gotrue are fully/partially external may be harder to change, it might be an option to transform the errors within the client libraries/supabase-js, could be messy?
Nicolo also dropped this as a reference: http://spec.openapis.org/oas/v3.0.3#openapi-specification"""
output = (
    "Standardize error responses from GoTrue, Postgres, and Realtime APIs for better DX"
)
expected = "Standardize Error Responses across APIs"

response = evaluator(output, expected, input=page_content)

print(f"Score: {response.score}")
print(f"Metadata: {response.metadata}")

Documentation

The full docs are available here.

Project details


Release history Release notifications | RSS feed

This version

0.0.8

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

autoevals-0.0.8.tar.gz (15.4 kB view details)

Uploaded Source

Built Distribution

autoevals-0.0.8-py3-none-any.whl (16.0 kB view details)

Uploaded Python 3

File details

Details for the file autoevals-0.0.8.tar.gz.

File metadata

  • Download URL: autoevals-0.0.8.tar.gz
  • Upload date:
  • Size: 15.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.17

File hashes

Hashes for autoevals-0.0.8.tar.gz
Algorithm Hash digest
SHA256 69971b826088cc7680d574fa0ba7d1072342bcff5c86673a01bff2962d2b8456
MD5 fe0576c8c1181e4e1d46afa6f53c1395
BLAKE2b-256 ccd3ca37d628641d63f5e930b91263c7a82e8b70b3925b0e900c849dae09e8bf

See more details on using hashes here.

File details

Details for the file autoevals-0.0.8-py3-none-any.whl.

File metadata

  • Download URL: autoevals-0.0.8-py3-none-any.whl
  • Upload date:
  • Size: 16.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.17

File hashes

Hashes for autoevals-0.0.8-py3-none-any.whl
Algorithm Hash digest
SHA256 054f57a3b147f1efb9c3d5116c9b1f9048b4c9e10c3563d5e78443a274b7264e
MD5 215ed77bb8406be9f74dcd64337f5fde
BLAKE2b-256 e55f9d90f8d66e8bf16c9475f6b671223ee1c899d8895cc1239ec77ae0bf91de

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page