An API for using metric models (either provided by default or fine-tuned yourself) to evaluate LLMs.
Project description
A library for using evaluation models (either default ones provided by LastMile or your own that are fine-tuned) to evaluate LLMs.
Evaluations are run on dataframes that include any combination of input
, ground_truth
, and output
columns. At least one of these columns must be defined and all values must be strings.
Synchronous Requests
You can use evaluate()
and stream_evaluate()
for non-streaming and streaming results.
Asynchronous Requests
You can use submit_job()
which will return a job_id string. Ideal for evaluations that don't require immediate responses. With a job_id (ex: cm0c4bxwo002kpe01h5j4zj2y
) you can:
-
Read job info:
<host_url>/api/auto_eval_job/read?id=<job_id>
Ex: https://eval.lastmileai.dev/api/auto_eval_job/read?id=cm0c4bxwo002kpe01h5j4zj2y
-
Retrieve results:
<host_url>/api/auto_eval_job/get_results?id=<job_id>
Ex: https://lastmileai.dev/api/auto_eval_job/get_results?id=cm0c4bxwo002kpe01h5j4zj2y
Example Usage
from lastmile_auto_eval import (
EvaluationMetric,
EvaluationResult,
evaluate,
stream_evaluate,
submit_job,
)
import pandas as pd
import json
from typing import Any, Generator
queries = ["what color is the sky?", "what color is the sky?"]
correct_answer = "the sky is blue"
incorrect_answer = "the sky is red"
ground_truth_values = [correct_answer, correct_answer]
responses = [correct_answer, incorrect_answer]
df = pd.DataFrame(
{
"input": queries,
"ground_truth": ground_truth_values,
"output": responses,
}
)
# Non-streaming
result: EvaluationResult = evaluate(
dataframe=df,
metrics=[
EvaluationMetric.P_FAITHFUL,
EvaluationMetric.SUMMARIZATION,
],
)
print(json.dumps(result, indent=2))
# Response will look something like this:
"""
{
"p_faithful": [
0.999255359172821,
0.00011296303273411468
],
"summarization": [
0.9995583891868591,
6.86283819959499e-05
]
}
"""
# Response-streaming
result_iterator: Generator[EvaluationResult, Any, Any] = (
stream_evaluate(
dataframe=df,
metrics=[
EvaluationMetric.P_FAITHFUL,
EvaluationMetric.SUMMARIZATION,
],
)
)
for result_chunk in result_iterator:
print(json.dumps(result_chunk, indent=2))
# Bidirectional-streaming
def gen_df_stream(input: list[str], gt: list[str], output: list[str]):
for i in range(len(input)):
df_chunk = pd.DataFrame(
{
"input": [input[i]],
"ground_truth": [gt[i]],
"output": [output[i]],
}
)
yield df_chunk
df_iterator = gen_df_stream(
input=queries, gt=ground_truth_values, output=responses
)
result_iterator: Generator[EvaluationResult, Any, Any] = (
stream_evaluate(
dataframe=df_iterator,
metrics=[
EvaluationMetric.P_FAITHFUL,
EvaluationMetric.SUMMARIZATION,
],
)
)
for result_chunk in result_iterator:
print(json.dumps(result_chunk, indent=2))
# Async request
job_id = submit_job(
df,
metrics=[
EvaluationMetric.P_FAITHFUL,
EvaluationMetric.SUMMARIZATION,
],
)
print(f"{job_id=}")
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file lastmile_auto_eval-0.0.5.tar.gz
.
File metadata
- Download URL: lastmile_auto_eval-0.0.5.tar.gz
- Upload date:
- Size: 10.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 53edd642aa48ef942cb4ae9abfdf5816669c922e286f8c93ad10b8384b9e14f7 |
|
MD5 | 99817aecc8c09c557310de4a9fb9dcf8 |
|
BLAKE2b-256 | 54e844356cc369ee08c4003167b71f98bb6f44c41376467db361fbf6e38b655b |
File details
Details for the file lastmile_auto_eval-0.0.5-py3-none-any.whl
.
File metadata
- Download URL: lastmile_auto_eval-0.0.5-py3-none-any.whl
- Upload date:
- Size: 11.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b30d5c1e9caf616f84049e19560863e23d059ef290834cd8cc8b16adc5b53a5f |
|
MD5 | 783a6e1ca00037938c6ac098b6be2017 |
|
BLAKE2b-256 | e44a7fcbfb06887b398d842ab5a79bb2278e234b9be86f504eb857e0d5294864 |