Skip to main content

LARES: vaLidation, evAluation and REliability Solutions

Project description

LARES: vaLidation, evAluation, and faiRnEss aSessments

Maintenance PyPI version PyPI - Python Version

A Python package designed to assist with the evaluation and validation models in various tasks such as translation, summarization, and rephrasing.

This package leverages a suite of existing tools and resources to provide the best form of evaluation and validation for the prompted task. Natural Language Toolkit (NLTK), BERT, and ROUGE are employed for evaluations, while Microsoft's Fairlearn, Facebook's BART, and roBERTa are used to assess and address the toxicity and fairness of a given model.

In addition, LARES uses datasets from HuggingFace, where the choice of datasets was informed by benchmark setters such as the General Language Understanding Evaluation (GLUE) benchmark.

Features

  • Quantitative and Qualitative Evaluation: Provides both qualitative and quantitative approaches to evaluating models. Quantitative metrics include METEOR scores for translations, normalized ROUGE scores for summarizations, and BERT scores for rephrasing tasks. Qualitative metrics are computed both from binary user judgements as well as sentiment analysis done on user feedback.

  • Fairness and Toxicity Validation: Provides a quantitative measure of the toxicity and fairness of a given model for specific tasks by leveraging Fairlearn and roBERTa.

  • Iterative Reconstruction: Iteratively rephrases model responses until below a specified toxicity and above a specified quality threshold using BART

Workflow

Prompt from Dataset

Start with a dataset and create a set of prompts and references to evaluate the model. The dataset can be a benchmark dataset obtained from sources such as HuggingFace, or it can be real-time data that has been scraped.

Task Determination/Labeling

Each task is classified according to its underlying purpose, such as translation, summarization, rephrasing, sentiment analysis, or classification. This classification provides two key benefits:

  1. Model Selection: Understanding the task helps us choose the best model for it, improving the overall performance of our framework.
  2. Response Evaluation: Different tasks require different evaluation metrics. By classifying our tasks, we can use the most appropriate metrics to evaluate the responses.

The datasets are labeled (by the user) based on potential differences. For instance, English-to-French prompts might be labeled 'fr', while English-to-Spanish prompts could be labeled 'es'. This helps us identify potential biases in the model.

Output Generation from Model

The prompt is passed to a model, which generates a response.

Evaluation According to Task Label and Validation

The evaluation score is calculated by comparing the model's response to a reference using a task-specific metric. The validation score is calculated by using a pre-trained model to determine the sentiment of the response and assign a toxicity/profanity metric. If the user chooses not to use the optional Rephrase/Detox loop, the scores and response are added to an output dictionary.

(OPTIONAL) Check Against Threshold, Check Num. Iterations, Rephrase/Detox, and Optional User Evaluation

The user can set a threshold for the validity and evaluation scores.

  1. If both scores exceed their respective thresholds, they, along with the response, are added to the output dictionary.

  2. If either score fails to meet its threshold, we enter an iterative loop of rephrasing and detoxifying. The user can set a maximum number of iterations for this process.

    A. The response will be rephrased and/or detoxified until it meets the threshold or until the maximum number of iterations is reached.

    B. If both scores exceed their thresholds, they, along with the response, are added to the output dictionary.

    C. If we reach the maximum number of iterations without exceeding both thresholds, the user is asked to review the results. This provides the opportunity to catch potential nuances in responses without relying solely on manual efforts. This step is optional. If the user participates, their evaluation is added to the output dictionary. If not, the scores from the final iteration are added.

Fairness

At this point, we have a set of labeled responses and their corresponding validation and evaluation scores. These labels and scores allow us to identify potential biases in the model. We provide the user with the responses, the average validation and evaluation scores for each labeled set, and an overall measure of the model's fairness.

Installation

Requires Python 3.6 or later. You can install using pip via:

pip install lares

Usage

Here is a basic usage example for translation task:

# Imports
import openai
from datasets import load_dataset
from lares import generate
import numpy as np

# Set your API key
openai.api_key = ''

# Loader
def load_translation_data(dataset_name, language_pair, num_samples=10):
    # Grab data
    dataset = load_dataset(dataset_name, language_pair)
    data = dataset["validation"]['translation'][:num_samples]

    # Create the prompts
    prompts = [f'Translate to {language_pair.split("-")[1]}: {item["en"]}' for item in data]
    # Get the references (correct translations)
    references = [item[language_pair.split("-")[1]] for item in data]
    # Return prompts and references
    return prompts, references

# Load the translation data
prompts_fr, refs_fr = load_translation_data("opus100", "en-fr")
prompts_es, refs_es = load_translation_data("opus100", "en-es")

# Combine the prompts and references
prompts = prompts_fr + prompts_es
references = refs_fr + refs_es
# Create labels for the data (0 for French, 1 for Spanish)
labels = np.concatenate([np.zeros(len(prompts_fr)), np.ones(len(prompts_es))]).tolist()

# Use the generate function from the LARES module to get the model's metrics for this task
data, bias, acc, tox = generate(prompts, references, labels, max_iterations=1, task_type='Translation', feedback=False)

# Print the results
print(f"Bias: {bias}")
print(f"Accuracy: {acc[0]} (Set 1), {acc[1]} (Set 2)")
print(f"Toxicity: {tox[0]} (Set 1), {tox[1]} (Set 2)")

Dependencies

  • openai==0.27.8
  • nltk==3.7
  • torch==2.0.1
  • transformers==4.31.0
  • rouge==1.0.1
  • bert_score==0.3.12
  • datasets==1.11.0

To be explicit, you can install via:

pip install openai==0.27.8 nltk==3.7 torch==2.0.1 transformers==4.31.0 rouge==1.0.1 bert_score==0.3.12 datasets==1.11.0

Though installation of LARES via pip should account for these underlying dependencies.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lares-0.0.31.tar.gz (8.4 kB view details)

Uploaded Source

Built Distribution

lares-0.0.31-py3-none-any.whl (8.9 kB view details)

Uploaded Python 3

File details

Details for the file lares-0.0.31.tar.gz.

File metadata

  • Download URL: lares-0.0.31.tar.gz
  • Upload date:
  • Size: 8.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.10.12

File hashes

Hashes for lares-0.0.31.tar.gz
Algorithm Hash digest
SHA256 16df22874633b1743a477748f208fc64d704541d30ee1e489360d26e91801364
MD5 c9c25bf3a12bd2a5e1433344dbfbf0ee
BLAKE2b-256 e9c938fde4aa8a816861505a38ef2b1dd62b3273b449f4556692ff1a7d78532f

See more details on using hashes here.

File details

Details for the file lares-0.0.31-py3-none-any.whl.

File metadata

  • Download URL: lares-0.0.31-py3-none-any.whl
  • Upload date:
  • Size: 8.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.10.12

File hashes

Hashes for lares-0.0.31-py3-none-any.whl
Algorithm Hash digest
SHA256 ec00d0cb83a2cd90b2b7e7db89f8fd3d6688bd548386e698f71591e70100800f
MD5 5448f8e447eac31a0d7d1f980e5572eb
BLAKE2b-256 51ae84b4378d1a3a596baf482862c3e7fa2ef6a36a71ae320becebfea913f15b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page