Skip to main content

An API for using metric models (either provided by default or fine-tuned yourself) to evaluate LLMs.

Project description

A library for using models (either default ones provided by LastMile or your own that are fine-tuned) to evaluate LLMs.

Evaluations are run on dataframes that include any combination of input, ground_truth, and output columns. At least one of these columns must be defined and all values must be strings.

Example usage:

from lastmile_auto_eval import (
    EvaluationMetric,
    EvaluationResult,
    evaluate,
    stream_evaluate,
)
import pandas as pd
import json
from typing import Any, Generator

queries = ["what color is the sky?", "what color is the sky?"]
statement_1 = "the sky is red"
statement_2 = "the sky is blue"
ground_truth_values = [statement_1, statement_1]
responses = [statement_1, statement_2]

df = pd.DataFrame(
    {
        "input": queries,
        "ground_truth": ground_truth_values,
        "output": responses,
    }
)

# Non-streaming
result: EvaluationResult = evaluate(
    dataframe=df,
    metrics=[
        EvaluationMetric.P_FAITHFUL,
        EvaluationMetric.SUMMARIZATION,
    ],
)
print(json.dumps(result, indent=2))

# Response will look something like this:
"""
{
  "p_faithful": [
    0.999255359172821,
    0.00011296303273411468
  ],
  "summarization": [
    0.9995583891868591,
    6.86283819959499e-05
  ]
}
"""

# Response-streaming
result_iterator: Generator[EvaluationResult, Any, Any] = (
    stream_evaluate(
        dataframe=df,
        metrics=[
            EvaluationMetric.P_FAITHFUL,
            EvaluationMetric.SUMMARIZATION,
        ],
    )
)
for result_chunk in result_iterator:
    print(json.dumps(result_chunk, indent=2))

# Bidirectional-streaming
df_iterator = gen_df_stream(
    input=queries, gt=ground_truth_values, output=responses
)
result_iterator: Generator[EvaluationResult, Any, Any] = (
    stream_evaluate(
        dataframe=df_iterator,
        metrics=[
            EvaluationMetric.P_FAITHFUL,
            EvaluationMetric.SUMMARIZATION,
        ],
    )
)
for result_chunk in result_iterator:
    print(json.dumps(result_chunk, indent=2))

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lastmile_auto_eval-0.0.2.tar.gz (9.3 kB view details)

Uploaded Source

Built Distribution

lastmile_auto_eval-0.0.2-py3-none-any.whl (10.3 kB view details)

Uploaded Python 3

File details

Details for the file lastmile_auto_eval-0.0.2.tar.gz.

File metadata

  • Download URL: lastmile_auto_eval-0.0.2.tar.gz
  • Upload date:
  • Size: 9.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.11.6

File hashes

Hashes for lastmile_auto_eval-0.0.2.tar.gz
Algorithm Hash digest
SHA256 71da9e6896dfeaa6096929fa8205b12d7667e6f5a2158ff8799c58303dc5410b
MD5 bb2b56f2f47e90446efff50a578ab775
BLAKE2b-256 4a2a826e1fbdd8301157912d927b11ddfcf5c8cdc372e5e80bdcea16294031e8

See more details on using hashes here.

File details

Details for the file lastmile_auto_eval-0.0.2-py3-none-any.whl.

File metadata

File hashes

Hashes for lastmile_auto_eval-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 4e2e642edc64e44e8fd8dfe87c02d9c1cfe8b354139f119f119cfbfb13c771ee
MD5 40bd77216314a937f3da5d5e1016296e
BLAKE2b-256 652ac02cf480fdea627aab4e7aa75e50d53edd4d95dafcdad2e27748d12c7cc9

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page