Skip to main content

An API for using metric models (either provided by default or fine-tuned yourself) to evaluate LLMs.

Project description

A library for using models (either default ones provided by LastMile or your own that are fine-tuned) to evaluate LLMs.

Evaluations are run on dataframes that include any combination of input, ground_truth, and output columns. At least one of these columns must be defined and all values must be strings.

Example usage:

from lastmile_auto_eval import (
    EvaluationMetric,
    EvaluationResult,
    evaluate,
    stream_evaluate,
)
import pandas as pd
import json
from typing import Any, Generator

queries = ["what color is the sky?", "what color is the sky?"]
statement_1 = "the sky is red"
statement_2 = "the sky is blue"
ground_truth_values = [statement_1, statement_1]
responses = [statement_1, statement_2]

df = pd.DataFrame(
    {
        "input": queries,
        "ground_truth": ground_truth_values,
        "output": responses,
    }
)

# Non-streaming
result: EvaluationResult = evaluate(
    dataframe=df,
    metrics=[
        EvaluationMetric.P_FAITHFUL,
        EvaluationMetric.SUMMARIZATION,
    ],
)
print(json.dumps(result, indent=2))

# Response will look something like this:
"""
{
  "p_faithful": [
    0.999255359172821,
    0.00011296303273411468
  ],
  "summarization": [
    0.9995583891868591,
    6.86283819959499e-05
  ]
}
"""

# Response-streaming
result_iterator: Generator[EvaluationResult, Any, Any] = (
    stream_evaluate(
        dataframe=df,
        metrics=[
            EvaluationMetric.P_FAITHFUL,
            EvaluationMetric.SUMMARIZATION,
        ],
    )
)
for result_chunk in result_iterator:
    print(json.dumps(result_chunk, indent=2))

# Bidirectional-streaming
def gen_df_stream(input: list[str], gt: list[str], output: list[str]):
    for i in range(len(input)):
        df_chunk = pd.DataFrame(
            {
                "input": [input[i]],
                "ground_truth": [gt[i]],
                "output": [output[i]],
            }
        )
        yield df_chunk

df_iterator = gen_df_stream(
    input=queries, gt=ground_truth_values, output=responses
)
result_iterator: Generator[EvaluationResult, Any, Any] = (
    stream_evaluate(
        dataframe=df_iterator,
        metrics=[
            EvaluationMetric.P_FAITHFUL,
            EvaluationMetric.SUMMARIZATION,
        ],
    )
)
for result_chunk in result_iterator:
    print(json.dumps(result_chunk, indent=2))

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lastmile_auto_eval-0.0.3.tar.gz (9.4 kB view details)

Uploaded Source

Built Distribution

lastmile_auto_eval-0.0.3-py3-none-any.whl (10.4 kB view details)

Uploaded Python 3

File details

Details for the file lastmile_auto_eval-0.0.3.tar.gz.

File metadata

  • Download URL: lastmile_auto_eval-0.0.3.tar.gz
  • Upload date:
  • Size: 9.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.11.6

File hashes

Hashes for lastmile_auto_eval-0.0.3.tar.gz
Algorithm Hash digest
SHA256 d8c82edaa2276fd55664ec54ffc1947822e393fc9ea1e038a017d78abef017f7
MD5 63596e699986305be3fc376ebfbbde0b
BLAKE2b-256 e7f2b6a437cca6c37395c49c21e787f6f2aa3308bbe9e7b5b35f01c3e7df5718

See more details on using hashes here.

File details

Details for the file lastmile_auto_eval-0.0.3-py3-none-any.whl.

File metadata

File hashes

Hashes for lastmile_auto_eval-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 467b1deead5780c7746225c8bc67e06dd8b9797a7bdbc5acc4c091e476492f36
MD5 306ca16505a12eb0d1aba8fa64b0e2f2
BLAKE2b-256 6c33ec1e796e46a423879e08b0e3be3406539cd2d8a9a42bbc2b4241a5652a5e

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page