Skip to main content

An API for using metric models (either provided by default or fine-tuned yourself) to evaluate LLMs.

Project description

A library for using evaluation models (either default ones provided by LastMile or your own that are fine-tuned) to evaluate LLMs.

Evaluations are run on dataframes that include any combination of input, ground_truth, and output columns. At least one of these columns must be defined and all values must be strings.

Synchronous Requests

You can use evaluate() and stream_evaluate() for non-streaming and streaming results.

Asynchronous Requests

You can use submit_job() which will return a job_id string. Ideal for evaluations that don't require immediate responses. With a job_id (ex: cm0c4bxwo002kpe01h5j4zj2y) you can:

  1. Read job info: <host_url>/api/auto_eval_job/read?id=<job_id>

    Ex: https://lastmileai.dev/api/auto_eval_job/read?id=cm0c4bxwo002kpe01h5j4zj2y

  2. Retrieve results: <host_url>/api/auto_eval_job/get_results?id=<job_id>

    Ex: https://lastmileai.dev/api/auto_eval_job/get_results?id=cm0c4bxwo002kpe01h5j4zj2y

Example Usage

from lastmile_auto_eval import (
    EvaluationMetric,
    EvaluationResult,
    evaluate,
    stream_evaluate,
    submit_job,
)
import pandas as pd
import json
from typing import Any, Generator

queries = ["what color is the sky?", "what color is the sky?"]
correct_answer = "the sky is blue"
incorrect_answer = "the sky is red"
ground_truth_values = [correct_answer, correct_answer]
responses = [correct_answer, incorrect_answer]

df = pd.DataFrame(
    {
        "input": queries,
        "ground_truth": ground_truth_values,
        "output": responses,
    }
)

# Non-streaming
result: EvaluationResult = evaluate(
    dataframe=df,
    metrics=[
        EvaluationMetric.P_FAITHFUL,
        EvaluationMetric.SUMMARIZATION,
    ],
)
print(json.dumps(result, indent=2))

# Response will look something like this:
"""
{
  "p_faithful": [
    0.999255359172821,
    0.00011296303273411468
  ],
  "summarization": [
    0.9995583891868591,
    6.86283819959499e-05
  ]
}
"""

# Response-streaming
result_iterator: Generator[EvaluationResult, Any, Any] = (
    stream_evaluate(
        dataframe=df,
        metrics=[
            EvaluationMetric.P_FAITHFUL,
            EvaluationMetric.SUMMARIZATION,
        ],
    )
)
for result_chunk in result_iterator:
    print(json.dumps(result_chunk, indent=2))

# Bidirectional-streaming
def gen_df_stream(input: list[str], gt: list[str], output: list[str]):
    for i in range(len(input)):
        df_chunk = pd.DataFrame(
            {
                "input": [input[i]],
                "ground_truth": [gt[i]],
                "output": [output[i]],
            }
        )
        yield df_chunk

df_iterator = gen_df_stream(
    input=queries, gt=ground_truth_values, output=responses
)
result_iterator: Generator[EvaluationResult, Any, Any] = (
    stream_evaluate(
        dataframe=df_iterator,
        metrics=[
            EvaluationMetric.P_FAITHFUL,
            EvaluationMetric.SUMMARIZATION,
        ],
    )
)
for result_chunk in result_iterator:
    print(json.dumps(result_chunk, indent=2))

# Async request
job_id = submit_job(
    df,
    metrics=[
        EvaluationMetric.P_FAITHFUL,
        EvaluationMetric.SUMMARIZATION,
    ],
)
print(f"{job_id=}")

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lastmile_auto_eval-0.0.6.tar.gz (10.9 kB view details)

Uploaded Source

Built Distribution

lastmile_auto_eval-0.0.6-py3-none-any.whl (11.8 kB view details)

Uploaded Python 3

File details

Details for the file lastmile_auto_eval-0.0.6.tar.gz.

File metadata

  • Download URL: lastmile_auto_eval-0.0.6.tar.gz
  • Upload date:
  • Size: 10.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.11.6

File hashes

Hashes for lastmile_auto_eval-0.0.6.tar.gz
Algorithm Hash digest
SHA256 90a8a61d6a199a3d98d6a2f836a922ffbc64d1205b55a2fbf7b5fdd9f06e57af
MD5 81bd9a26974c932d0a29694867187748
BLAKE2b-256 371f4992bf0cde14b9bf1b075ae4b0c334e2e5030b5ba58096f3015c1d73fe7a

See more details on using hashes here.

File details

Details for the file lastmile_auto_eval-0.0.6-py3-none-any.whl.

File metadata

File hashes

Hashes for lastmile_auto_eval-0.0.6-py3-none-any.whl
Algorithm Hash digest
SHA256 b07426a034637e1d14bbdd26eb4aa01a0e810d21e058b5dd38abef98f1f48409
MD5 2275263a90146099f5da8faa96bbfcff
BLAKE2b-256 a80b01a1b3fadffb0f27a6ecc507ddc2a62bba6ae0203a6e78686116d509d62a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page