Skip to main content

Universal library for evaluating AI models

Project description

AutoEvals

AutoEvals is a tool for quickly and easily evaluating AI model outputs. It comes with a variety of evaluation methods, including heuristic (e.g. Levenshtein distance), statistical (e.g. BLEU), and model-based (using LLMs).

Many of the model-based evaluations are adapted from OpenAI's excellent evals, project but are implemented so you can flexibly run them on individual examples, tweak the prompts, and debug their outputs.

You can also add your own custom prompts, and use AutoEvals to deal with adding Chain-of-Thought, parsing outputs, and managing exceptions.

Installation

To install AutoEvals, run the following command:

pip install autoevals

Example

from autoevals.llm import *

evaluator = Fact()
result = evaluator(
    output="People's Republic of China", expected="China",
    input="Which country has the highest population?"
)
print(result.score)
print(result.metadata)

Supported Evaluation Methods

Heuristic

  • Levenshtein distance

  • Jaccard distance

  • BLEU

Model-Based Classification

  • Battle
  • ClosedQA
  • Humor
  • Factuality
  • Security
  • Summarization
  • SQL
  • Translation

Other Model-Based

  • Embedding distance
  • Fine-tuned classifiers

Custom Evaluation Prompts

AutoEvals supports custom evaluation prompts. To use them, simply pass in a prompt and scoring mechanism:

from autoevals import LLMClassifier

evaluator = LLMClassifier(
    """
You are a technical project manager who helps software engineers generate better titles for their GitHub issues.
You will look at the issue description, and pick which of two titles better describes it.

I'm going to provide you with the issue description, and two possible titles.

Issue Description: {{page_content}}

1: {{output}}
2: {{expected}}

Please discuss each title briefly (one line for pros, one for cons).
""",
    {"1": 1, "2": 0},
    use_cot=False,
)

page_content = """
As suggested by Nicolo, we should standardize the error responses coming from GoTrue, postgres, and realtime (and any other/future APIs) so that it's better DX when writing a client,

We can make this change on the servers themselves, but since postgrest and gotrue are fully/partially external may be harder to change, it might be an option to transform the errors within the client libraries/supabase-js, could be messy?

Nicolo also dropped this as a reference: http://spec.openapis.org/oas/v3.0.3#openapi-specification"""

gen_title = "Standardize error responses from GoTrue, Postgres, and Realtime APIs for better DX"
original_title = "Standardize Error Responses across APIs"


response = evaluator(gen_title, original_title, page_content=page_content)

print(f"Score: {response.score}")

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

autoevals-0.0.2.tar.gz (7.4 kB view details)

Uploaded Source

Built Distribution

autoevals-0.0.2-py3-none-any.whl (6.6 kB view details)

Uploaded Python 3

File details

Details for the file autoevals-0.0.2.tar.gz.

File metadata

  • Download URL: autoevals-0.0.2.tar.gz
  • Upload date:
  • Size: 7.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.5

File hashes

Hashes for autoevals-0.0.2.tar.gz
Algorithm Hash digest
SHA256 f56e2132798fa2b4c2848f8c1ea255ac9faec27d7a61a8e9f167bbdf195b7c0c
MD5 d24849426976d583a50b0734588e101e
BLAKE2b-256 b59e90b2882c75853e60a7d695dde8d1466be8a8faf4892eaa54753bf652b06e

See more details on using hashes here.

File details

Details for the file autoevals-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: autoevals-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 6.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.5

File hashes

Hashes for autoevals-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 84f8e89971a6f999c6da958cf459c3f9884e1a9f47c11d3431a6e93aed78f782
MD5 db92c39f2e40c7bb19ef891177169384
BLAKE2b-256 3073a2552132639c21d42a2bac8c438bd4651f305765f3c3152374a5ad544079

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page