Skip to main content

Library that makes it easier to evaluate the quality of LLM outputs

Project description

LaPET Overview

Public LLM leaderboards like Huggingface are great for getting a general idea of which LLM models perform well. However, this is not useful when we need to evaluate models for specific LLM generative tasks. The LMSys Chatbot Arena does provide interesting results but is too general.

LaPET stands for Language Pairwise Evaluation Toolkit and is targeted at users that need to know how well a model will work for a specific task like summarizing a customer service call or putting together an action plan to resolve a customer issue or analyzing a spreadsheet for inconsistencies. These real world tasks require an evaluation method that is easy to utilize for any kind of user, whether you want to create your own LLM benchmark or use data from ours.

The purpose of this library is to make it easier to evaluate the quality of LLM outputs from multiple models across a set of user selectable tasks. LLM outputs are evaluated using LLM as a judge (GPT4o).

How it Works

LaPET does a pairwise preference evaluation for every possible pair of LLM outputs. Users define a set of prompts for generation, the number of samples they would like to use, and which (supported) models they would like to evaluate. We randomize the model output order (first or second) to reduce the change of positional preference. We also try to eliminate any extra language that might affect preference based on output length. Both the LLM outputs and LLM as a judge evaluations are stored in CSV files for further analysis.

Requirements

The current version of LaPET requires access to GPUs on a server or you can use this Google Colab Template that will work if you have a Google Colab Pro+ account. You will also need a HuggingFace account to download models and an OpenAI account to utilize LLM as a judge.

Supported Models

We plan on adding more models as we have time and based upon request. The library currently supports outputs generated by the following models:

  • llama2_7b_chat
  • llama3_8b_instruct
  • phi_3
  • zephyr_7b_beta
  • gemma_7b

We utilize GPT-4o as the LLM evaluator (judge), which picks a winner between a pair of LLM generated outputs.

Getting Started

You will need an A100 or H100 with at least 40GB of RAM to run LaPET locally. Alternatively, you can utilize the Google Colab template if you have a Google Colab Pro+ account (use the A100).

  • Edit generate.py as needed. You can change which models you want to evaluate and change the global model parameters like temperature and max_length. You can also change the prompts to suite the tasks you want to evaluate and how many output samples you would like to generate.
  • Run generate.py (you will need your HuggingFace User Access Token and a local GPU with 40GB of memory. We have test NVidia A100s and H100s).
  • This will generate a set of responses for each model for each prompt and store it in eval_data.csv. These are the model outputs that will be evaluated by the LLM judge (GPT-4o).
  • Run evaluate.py You will need to have your OpenAI environment variables set up to run this script including: OPENAI_ORG, OPENAI_PROJECT, OPENAI_KEY
  • This will generate the evaluation results for each pairwise evaluation in eval_results.csv.
  • We have provided a Jupyter notebook to create a preference graph for each model contained in Evaluation_Results.ipynb.

Limitations

  • We do not evaluate (yet) for accuracy.
  • The prompts are global until we support model level prompts. This might affect the quality of output since each LLM is more or less sensitive to different prompt strategies.
  • We randomly select conversations from a large synthetic dataset, which causes the results to vary from one run to another.
  • Eval-LLM is only configured to work with one kind of dataset (conversations) until we add support for others.

Planned Features

  • Create task templates for testing different kinds of task groups beyond conversation tasks
  • Add the ability to select more than one judge, including a human evaluator
  • Add the ability to use custom prompts for each model
  • Make the default prompts more robust across all models
  • Create a smaller refined dataset
  • Add flash attention for Phi-3
  • Create a Gemma subclass to handle lack of chat template in tokenizer
  • Prompt optimizer to automatically recommend and test different prompt strategies for a task / model
  • Add performance metrics (memory, tokens/second)
  • Add option to utilize cloud LLM endpoints like Groq, Cloudflare, etc.
  • Add commercial models like Cohere, Anthropic, Google, Mistral

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lapet-0.7.2.tar.gz (7.9 kB view details)

Uploaded Source

Built Distribution

lapet-0.7.2-py3-none-any.whl (8.5 kB view details)

Uploaded Python 3

File details

Details for the file lapet-0.7.2.tar.gz.

File metadata

  • Download URL: lapet-0.7.2.tar.gz
  • Upload date:
  • Size: 7.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.10.14

File hashes

Hashes for lapet-0.7.2.tar.gz
Algorithm Hash digest
SHA256 b7aec9154dcd3c9bdf20eb0deef8682f6e7fbc505fa787b00a3ab18b93b14a6d
MD5 fd2455b220addd8f8f74e022bd9a231e
BLAKE2b-256 5c91ab97e0b966cc078570c2d05fe5ec55074b5d2a21148d162caae25b1af4ee

See more details on using hashes here.

File details

Details for the file lapet-0.7.2-py3-none-any.whl.

File metadata

  • Download URL: lapet-0.7.2-py3-none-any.whl
  • Upload date:
  • Size: 8.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.10.14

File hashes

Hashes for lapet-0.7.2-py3-none-any.whl
Algorithm Hash digest
SHA256 b8bbe716ae1d01bd6af7f019edb52d2669951d45a05e9d3ace123f67d6fc3c50
MD5 f2ce832812d6c2343272c8522256ebc7
BLAKE2b-256 f7d979c1a1a7c2594d9fa839ce0d11964711b45936a4cbc359cf96a8eb79a551

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page