An API to measure evaluation criteria (ex: faithfulness) of generative AI outputs

These details have been verified by PyPI

Maintainers

ankushp94 jonathan.lastmileai.dev rholinshead.lastmileai rossdan saqadri

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

LastMile AI Eval

Library of tools to evaluate your RAG system.

Setup

Get a LastMile API token (see section below)
Install this library: pip install lastmile-eval
Gather your data that needs evaluation
Usage: see examples below.

LastMile API token

To get a LastMile AI token, please go to the LastMile token's webpage. You can create an account with Google or Github and then click the "Create new token" in the "API Tokens" section. Once a token is created, be sure to save it somewhere since you won't be able to see the value of it from the website again (though you can create a new one if that happens).

Please be careful not to share your token on GitHub. Instead we recommend saving it under your project’s (or home directory) .env file as: LASTMILE_API_TOKEN=<TOKEN_HERE>, and use loadenv instead. See the examples/ folder for how to do this.

LLM Provider Tokens (`.env` file)

In order to use LLM-based evaluators, add your other API tokens to your .env file. Example: OPENAI_API_KEY=<TOKEN_HERE>

Examples

Example 1: RAG Evaluation Script

"""The RAG evaluation API runs evaluation criteria (ex: faithfulness) 
    of generative AI outputs from a RAG system.

Particularly, we evaluate based on this triplet of information:

1. User query
2. Data that goes into the LLM
3. LLM's output response

The  `get_rag_eval_scores()` function returns a faithfulness score from 0 to 1.
"""

import os
import sys
from textwrap import dedent

import dotenv
import pandas as pd

from lastmile_eval.rag import get_rag_eval_scores


def main():
    rag_scores_example_1()
    rag_scores_example_2()
    return 0


def get_lastmile_api_token():
    """See README.md for mor information."""
    dotenv.load_dotenv()
    api_token = os.getenv("LASTMILE_API_TOKEN")
    assert api_token is not None
    return api_token


def rag_scores_example_1():
    print("\n\nRAG scores example 1:")
    statement1 = "the sky is red"
    statement2 = "the sky is blue"

    queries = ["what color is the sky?", "is the sky blue?"]
    data = [statement1, statement1]
    responses = [statement1, statement2]
    api_token = get_lastmile_api_token()
    result = get_rag_eval_scores(
        queries,
        data,
        responses,
        api_token,
    )

    print("Result: ", result)

    # result will look something like:
    # {'p_faithful': [0.9955534338951111, 6.857347034383565e-05]}


def rag_scores_example_2():
    print("\n\nRAG scores example 2:")
    questions = ["what is the ultimate purpose of the endpoint?"] * 2

    data1 = """
    Server-side, we will need to expose a prompt_schemas endpoint 
    which provides the mapping of model name → prompt schema 
    which we will use for rendering prompt input/settings/metadata on the client
    """

    data = [data1] * 2

    responses = ["""client rendering""", """metadata mapping"""]

    # f"{data1}. Query: {questions[0]}",
    # f"{data1}. Query: {questions[1]}",

    print(f"Input batch:")
    df = pd.DataFrame(
        {"question": questions, "data": data, "response": responses}
    )
    print(df)

    api_token = get_lastmile_api_token()

    result_dict = get_rag_eval_scores(
        questions,
        data,
        responses,
        api_token,
    )

    df["p_faithful"] = result_dict["p_faithful"]

    print(
        dedent(
            """
            Given a question and reference data (assumed to be factual), 
            the faithfulness score estimates whether 
            the response correctly answers the question according to the given data.
            """
        )
    )
    print("Dataframe with scores:")
    print(df)


if __name__ == "__main__":
    sys.exit(main())

Example 2: General Text Evaluators Script

"""The text module provides more general evaluation functions
for text generated by AI models."""

import sys

import dotenv

import lastmile_eval.text as lm_eval_text


def main():
    # Openai evaluators require openai API key in .env file.
    # See README.md for more information about `.env`.
    dotenv.load_dotenv()

    SUPPORTED_BACKING_LLMS = [
        "gpt-3.5-turbo",
        "gpt-4",
    ]

    print("Starting text evaluation examples.")

    for model_name in SUPPORTED_BACKING_LLMS:
        print(
            f"\n\n\n\nRunning example evaluators with backing LLM {model_name}"
        )
        text_scores_example_1(model_name)
        text_scores_example_2(model_name)
        text_scores_example_3(model_name)

    return 0


def text_scores_example_1(model_name: str):
    texts_to_evaluate = [
        "The quick brown fox jumps over the lazy dog.",
        "The quick brown fox jumps over the lazy dog.",
    ]
    references = [
        "The quick brown fox jumps over the lazy dog.",
        "The swift brown fox leaps over the lazy dog.",
    ]
    bleu = lm_eval_text.calculate_bleu_score(texts_to_evaluate, references)
    print("\n\nTexts to evaluate: ", texts_to_evaluate)
    print("References: ", references)
    print("\nBLEU scores: ", bleu)

    rouge1 = lm_eval_text.calculate_rouge1_score(texts_to_evaluate, references)
    print("\nROUGE1 scores: ", rouge1)

    exact_match = lm_eval_text.calculate_exact_match_score(
        texts_to_evaluate, references
    )

    print("\nExact match scores: ", exact_match)

    relevance = lm_eval_text.calculate_relevance_score(
        texts_to_evaluate, references, model_name=model_name
    )

    print("\nRelevance scores: ", relevance)

    summarization = lm_eval_text.calculate_summarization_score(
        texts_to_evaluate, references, model_name=model_name
    )

    print("\nSummarization scores: ", summarization)

    custom_semantic_similarity = (
        lm_eval_text.calculate_custom_llm_metric_example_semantic_similarity(
            texts_to_evaluate, references, model_name=model_name
        )
    )

    print("\nCustom semantic similarity scores: ", custom_semantic_similarity)


def text_scores_example_2(model_name: str):
    texts_to_evaluate = [
        "The quick brown fox jumps over the lazy dog.",
        "The quick brown fox jumps over the lazy dog.",
    ]
    references = [
        "The quick brown fox jumps over the lazy dog.",
        "The swift brown fox leaps over the lazy dog.",
    ]

    questions = ["What does the animal do", "Describe the fox"]

    qa = lm_eval_text.calculate_qa_score(
        texts_to_evaluate, references, questions, model_name=model_name
    )
    print("\n\nTexts to evaluate: ", texts_to_evaluate)
    print("References: ", references)
    print("\nQA scores: ", qa)

    human_vs_ai = lm_eval_text.calculate_human_vs_ai_score(
        texts_to_evaluate, references, questions, model_name=model_name
    )

    print("\nHuman vs AI scores: ", human_vs_ai)


def text_scores_example_3(model_name: str):
    texts_to_evaluate = [
        "I am happy",
        "I am sad",
    ]

    toxicity = lm_eval_text.calculate_toxicity_score(
        texts_to_evaluate, model_name=model_name
    )
    print("\nToxicity scores: ", toxicity)

    custom_sentiment = (
        lm_eval_text.calculate_custom_llm_metric_example_sentiment(
            texts_to_evaluate, model_name=model_name
        )
    )

    print("\nCustom sentiment scores: ", custom_sentiment)


if __name__ == "__main__":
    sys.exit(main())

Project details

These details have been verified by PyPI

Maintainers

ankushp94 jonathan.lastmileai.dev rholinshead.lastmileai rossdan saqadri

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

0.0.54

May 30, 2024

0.0.53

May 29, 2024

0.0.52

May 28, 2024

0.0.51

May 24, 2024

0.0.50

May 24, 2024

0.0.49

May 24, 2024

0.0.48

May 24, 2024

0.0.47

May 22, 2024

0.0.46

May 22, 2024

0.0.45

May 22, 2024

0.0.44

May 21, 2024

0.0.43

May 21, 2024

0.0.42

May 16, 2024

0.0.41

May 15, 2024

0.0.40

May 15, 2024

0.0.39

May 14, 2024

0.0.38

May 14, 2024

0.0.37

May 13, 2024

0.0.36

May 13, 2024

0.0.35

May 10, 2024

0.0.34

May 9, 2024

0.0.33

May 8, 2024

0.0.31

May 7, 2024

0.0.30

May 7, 2024

0.0.29

May 3, 2024

0.0.28

May 3, 2024

0.0.27

May 2, 2024

0.0.25

May 2, 2024

0.0.24

May 1, 2024

0.0.23

May 1, 2024

0.0.22

May 1, 2024

0.0.21

Apr 30, 2024

0.0.20

Apr 29, 2024

0.0.18

Apr 26, 2024

0.0.17

Apr 26, 2024

0.0.16

Apr 25, 2024

0.0.15

Apr 24, 2024

0.0.14

Apr 24, 2024

0.0.13

Apr 23, 2024

0.0.12

Apr 23, 2024

0.0.11

Apr 23, 2024

0.0.10

Apr 23, 2024

0.0.8

Apr 23, 2024

0.0.7

Apr 23, 2024

0.0.6

Apr 2, 2024

0.0.5

Apr 1, 2024

0.0.4

Apr 1, 2024

0.0.3

Mar 28, 2024

0.0.2

Mar 28, 2024

0.0.1

Mar 28, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lastmile_eval-0.0.54.tar.gz (2.4 MB view hashes)

Uploaded May 30, 2024 Source

Built Distribution

lastmile_eval-0.0.54-py3-none-any.whl (2.5 MB view hashes)

Uploaded May 30, 2024 Python 3

Hashes for lastmile_eval-0.0.54.tar.gz

Hashes for lastmile_eval-0.0.54.tar.gz
Algorithm	Hash digest
SHA256	`9dd4fe26431dfd75d78f8ccb84d774b645db9481ba19d84618497abb5cd99f28`
MD5	`4cb1a41bee6d36169722a5e0e1a95611`
BLAKE2b-256	`143c22fc176d5c62afd9af549d488df3e4a0e076a95a9795e2488f795f21cab1`

Hashes for lastmile_eval-0.0.54-py3-none-any.whl

Hashes for lastmile_eval-0.0.54-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7adbd10bc7eade06ebf6a3606045a7284f732be6148cf367fc289b1345a57f7d`
MD5	`8eecac707fab015e4a9be4646faab02f`
BLAKE2b-256	`31b88cf86154998f16ff0f1f1fdd9b5e7262d4905cb8636f53514a373816c387`

lastmile-eval 0.0.54

Navigation

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Project description

LastMile AI Eval

Setup

LastMile API token

LLM Provider Tokens (`.env` file)

Examples

Example 1: RAG Evaluation Script

Example 2: General Text Evaluators Script

Project details

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

lastmile-eval 0.0.54

Navigation

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Project description

LastMile AI Eval

Setup

LastMile API token

LLM Provider Tokens (.env file)

Examples

Example 1: RAG Evaluation Script

Example 2: General Text Evaluators Script

Project details

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

LLM Provider Tokens (`.env` file)