An API for using metric models (either provided by default or fine-tuned yourself) to evaluate LLMs.
Project description
A library for using evaluation models (either default ones provided by LastMile or your own that are fine-tuned) to evaluate LLMs.
Evaluations are run on dataframes that include any combination of input
, ground_truth
, and output
columns. At least one of these columns must be defined and all values must be strings.
Synchronous Requests
You can use evaluate()
and stream_evaluate()
for non-streaming and streaming results.
Asynchronous Requests
You can use submit_job()
which will return a job_id string. Ideal for evaluations that don't require immediate responses. With a job_id (ex: cm0c4bxwo002kpe01h5j4zj2y
) you can:
-
Read job info:
<host_url>/api/auto_eval_job/read?id=<job_id>
Ex: https://lastmileai.dev/api/auto_eval_job/read?id=cm0c4bxwo002kpe01h5j4zj2y
-
Retrieve results:
<host_url>/api/auto_eval_job/get_results?id=<job_id>
Ex: https://lastmileai.dev/api/auto_eval_job/get_results?id=cm0c4bxwo002kpe01h5j4zj2y
Example Usage
from lastmile_auto_eval import (
EvaluationMetric,
EvaluationResult,
evaluate,
stream_evaluate,
submit_job,
)
import pandas as pd
import json
from typing import Any, Generator
queries = ["what color is the sky?", "what color is the sky?"]
correct_answer = "the sky is blue"
incorrect_answer = "the sky is red"
ground_truth_values = [correct_answer, correct_answer]
responses = [correct_answer, incorrect_answer]
df = pd.DataFrame(
{
"input": queries,
"ground_truth": ground_truth_values,
"output": responses,
}
)
# Non-streaming
result: EvaluationResult = evaluate(
dataframe=df,
metrics=[
EvaluationMetric.P_FAITHFUL,
EvaluationMetric.SUMMARIZATION,
],
)
print(json.dumps(result, indent=2))
# Response will look something like this:
"""
{
"p_faithful": [
0.999255359172821,
0.00011296303273411468
],
"summarization": [
0.9995583891868591,
6.86283819959499e-05
]
}
"""
# Response-streaming
result_iterator: Generator[EvaluationResult, Any, Any] = (
stream_evaluate(
dataframe=df,
metrics=[
EvaluationMetric.P_FAITHFUL,
EvaluationMetric.SUMMARIZATION,
],
)
)
for result_chunk in result_iterator:
print(json.dumps(result_chunk, indent=2))
# Bidirectional-streaming
def gen_df_stream(input: list[str], gt: list[str], output: list[str]):
for i in range(len(input)):
df_chunk = pd.DataFrame(
{
"input": [input[i]],
"ground_truth": [gt[i]],
"output": [output[i]],
}
)
yield df_chunk
df_iterator = gen_df_stream(
input=queries, gt=ground_truth_values, output=responses
)
result_iterator: Generator[EvaluationResult, Any, Any] = (
stream_evaluate(
dataframe=df_iterator,
metrics=[
EvaluationMetric.P_FAITHFUL,
EvaluationMetric.SUMMARIZATION,
],
)
)
for result_chunk in result_iterator:
print(json.dumps(result_chunk, indent=2))
# Async request
job_id = submit_job(
df,
metrics=[
EvaluationMetric.P_FAITHFUL,
EvaluationMetric.SUMMARIZATION,
],
)
print(f"{job_id=}")
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for lastmile_auto_eval-0.0.10.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 47ac3d28647bb8f57d0dfc1c1a5c6a1601a9d14666bab38040ade7a3278ccacd |
|
MD5 | f7dfd0ad8dfe1bb20d3be35883b0d07c |
|
BLAKE2b-256 | 5b01a72161fe61040057703d658b7d3e957b0383430a9c12286b9b082c13a678 |
Hashes for lastmile_auto_eval-0.0.10-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 027740c9670323e69092a7d674c17b1ddaf96dbf9df6687a489bcc3b11ef5122 |
|
MD5 | a725c0a189a3b398d721276cf90c539f |
|
BLAKE2b-256 | 5a804ed310973c7c188cdd9be8eb073d5adf07250c6d020c38172fdb508c923a |