Universal library for evaluating AI models
Project description
AutoEvals
AutoEvals is a tool for quickly and easily evaluating AI model outputs. It comes with a variety of evaluation methods, including heuristic (e.g. Levenshtein distance), statistical (e.g. BLEU), and model-based (using LLMs).
Many of the model-based evaluations are adapted from OpenAI's excellent evals, project but are implemented so you can flexibly run them on individual examples, tweak the prompts, and debug their outputs.
You can also add your own custom prompts, and use AutoEvals to deal with adding Chain-of-Thought, parsing outputs, and managing exceptions.
Installation
To install AutoEvals, run the following command:
pip install autoevals
Example
from autoevals.llm import *
evaluator = Fact()
result = evaluator(
output="People's Republic of China", expected="China",
input="Which country has the highest population?"
)
print(result.score)
print(result.metadata)
Supported Evaluation Methods
Heuristic
-
Levenshtein distance
-
Jaccard distance
-
BLEU
Model-Based Classification
- Battle
- ClosedQA
- Humor
- Factuality
- Security
- Summarization
- SQL
- Translation
Other Model-Based
- Embedding distance
- Fine-tuned classifiers
Custom Evaluation Prompts
AutoEvals supports custom evaluation prompts. To use them, simply pass in a prompt and scoring mechanism:
from autoevals import LLMClassifier
evaluator = LLMClassifier(
"""
You are a technical project manager who helps software engineers generate better titles for their GitHub issues.
You will look at the issue description, and pick which of two titles better describes it.
I'm going to provide you with the issue description, and two possible titles.
Issue Description: {{page_content}}
1: {{output}}
2: {{expected}}
Please discuss each title briefly (one line for pros, one for cons).
""",
{"1": 1, "2": 0},
use_cot=False,
)
page_content = """
As suggested by Nicolo, we should standardize the error responses coming from GoTrue, postgres, and realtime (and any other/future APIs) so that it's better DX when writing a client,
We can make this change on the servers themselves, but since postgrest and gotrue are fully/partially external may be harder to change, it might be an option to transform the errors within the client libraries/supabase-js, could be messy?
Nicolo also dropped this as a reference: http://spec.openapis.org/oas/v3.0.3#openapi-specification"""
gen_title = "Standardize error responses from GoTrue, Postgres, and Realtime APIs for better DX"
original_title = "Standardize Error Responses across APIs"
response = evaluator(gen_title, original_title, page_content=page_content)
print(f"Score: {response.score}")
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file autoevals-0.0.2.tar.gz
.
File metadata
- Download URL: autoevals-0.0.2.tar.gz
- Upload date:
- Size: 7.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f56e2132798fa2b4c2848f8c1ea255ac9faec27d7a61a8e9f167bbdf195b7c0c |
|
MD5 | d24849426976d583a50b0734588e101e |
|
BLAKE2b-256 | b59e90b2882c75853e60a7d695dde8d1466be8a8faf4892eaa54753bf652b06e |
File details
Details for the file autoevals-0.0.2-py3-none-any.whl
.
File metadata
- Download URL: autoevals-0.0.2-py3-none-any.whl
- Upload date:
- Size: 6.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 84f8e89971a6f999c6da958cf459c3f9884e1a9f47c11d3431a6e93aed78f782 |
|
MD5 | db92c39f2e40c7bb19ef891177169384 |
|
BLAKE2b-256 | 3073a2552132639c21d42a2bac8c438bd4651f305765f3c3152374a5ad544079 |