Skip to main content

Judy is a python library and framework to evaluate the text-generation capabilities of Large Language Models (LLM) using a Judge LLM.

Project description

Judy

Judy is a python library and framework to evaluate the text-generation capabilities of Large Language Models (LLM) using a Judge LLM.

Judy allows users to evaluate LLMs using a competent Judge LLM (such as GPT-4). Users can choose from a set of predefined scenarios sourced from recent research, or design their own. A scenario is a specific test designed to evaluate a particular aspect of an LLM. A scenario consists of:

  • Dataset: A source dataset to generate prompts to evaluate models against.
  • Task: A task to evaluate models on. Tasks for judge evaluations have been carefully designed by researchers to assess certain aspects of LLMs.
  • Metric: The metric(s) to use when evaluating the responses from a task. For example - accuracy, level of detail etc.

Framework Overview

Judy has been inspired by techniques used in research including HELM [1] and LLM-as-a-judge [2].


Installation

Use the package manager pip to install Judy. Note: Judy requires python >= 3.10.

pip install judyeval

Alternate Installation

You can also install Judy directly from this git repo:

pip install git+https://github.com/TNT-Hoopsnake/judy

Getting Started

Setup configs

Judy uses 3 configuration files during evaluation. Only the run config is strictly necessary to begin with:

  • Dataset Config: Defines all of the datasets available to use in the evaluation run, how to download them and which class to use to format them. You don't have to worry about specifying this config unless you plan on adding new datasets. Judy will automatically use the example dataset config here unless you specify an alternate one using --dataset-config.
  • Evaluation Config: Defines all of the tasks and the metrics used to evaluate them. It also restricts which datasets and metrics can be used for each task. You don't have to worry about specifying this config unless you plan on adding new tasks or metrics. Judy will automatically use the example eval config here unless you specify an alternate one using --eval-config.
  • Run Config: Defines all of the settings to use for your evaluation run. The evaluation results for your run will store a copy (with sensitive details redacted) of these settings as metadata. An example run config is provided here

Setup model(s) to evaluate

Ensure you have API access to the models you wish to evaluate. We currently support two types of API formats:

  • OPENAI: The OpenAI API ChatCompletion endpoint (ref)
  • HUGGINGFACE: The HuggingFace Hosted Inference API (ref)

If you are hosting models locally you can use a package like LocalAI to get an OpenAI compatible REST API which can be used by Judy.

Judy Commands

A CLI interface is provided for viewing and editing Judy config files.

judy config

Run an evaluation as follows:

judy run --run-config run_config.yml --name disinfo-test --output ./results

After running an evaluation, you can serve a web app for viewing the results:

judy serve -r ./results

Web App Screenshots

The web app allows you to view your evaluation results.

Overview App Runs
Raw Results

Roadmap

Features

  • Core framework
  • Web app - to view evaluation results
  • Add perturbations - the ability to modify input datasets - with typos, synonymns etc.
  • Add adaptations - the ability to use different prompting techniques - such as Chain of Thought etc.

Scenarios

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change. Please make sure to update tests as appropriate. Check out the contribution guide for more details.

Citation - BibTeX

@software{Hutchinson_Judy_-_LLM_2024,
  author = {Hutchinson, Linden and Raghavan, Rahul},
  month = feb,
  title = {{Judy - LLM Evaluator}},
  url = {https://github.com/TNT-Hoopsnake/judy},
  version = {2.0.0},
  year = {2024}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

judyeval-2.1.0.tar.gz (68.1 kB view hashes)

Uploaded Source

Built Distribution

judyeval-2.1.0-py3-none-any.whl (85.4 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page