Skip to main content

Judy is a python library and framework to evaluate the text-generation capabilities of Large Language Models (LLM) using a Judge LLM.

Project description

Judy

Judy is a python library and framework to evaluate the text-generation capabilities of Large Language Models (LLM) using a Judge LLM.

Judy allows users to evaluate LLMs using a competent Judge LLM (such as GPT-4). Users can choose from a set of predefined scenarios sourced from recent research, or design their own. A scenario is a specific test designed to evaluate a particular aspect of an LLM. A scenario consists of:

  • Dataset: A source dataset to generate prompts to evaluate models against.
  • Task: A task to evaluate models on. Tasks for judge evaluations have been carefully designed by researchers to assess certain aspects of LLMs.
  • Metric: The metric(s) to use when evaluating the responses from a task. For example - accuracy, level of detail etc.

Framework Overview

Judy has been inspired by techniques used in research including HELM [1] and LLM-as-a-judge [2].


Installation

Use the package manager pip to install Judy. Note: Judy requires python >= 3.10.

pip install judyeval

Alternate Installation

You can also install Judy directly from this git repo:

pip install git+https://github.com/TNT-Hoopsnake/judy

Getting Started

Setup configs

Judy uses 3 configuration files during evaluation. Only the run config is strictly necessary to begin with:

  • Dataset Config: Defines all of the datasets available to use in the evaluation run, how to download them and which class to use to format them. You don't have to worry about specifying this config unless you plan on adding new datasets. Judy will automatically use the example dataset config here unless you specify an alternate one using --dataset-config.
  • Evaluation Config: Defines all of the tasks and the metrics used to evaluate them. It also restricts which datasets and metrics can be used for each task. You don't have to worry about specifying this config unless you plan on adding new tasks or metrics. Judy will automatically use the example eval config here unless you specify an alternate one using --eval-config.
  • Run Config: Defines all of the settings to use for your evaluation run. The evaluation results for your run will store a copy (with sensitive details redacted) of these settings as metadata. An example run config is provided here

Setup model(s) to evaluate

Ensure you have API access to the models you wish to evaluate. We currently support two types of API formats:

  • OPENAI: The OpenAI API ChatCompletion endpoint (ref)
  • HUGGINGFACE: The HuggingFace Hosted Inference API (ref)

If you are hosting models locally you can use a package like LocalAI to get an OpenAI compatible REST API which can be used by Judy.

Judy Commands

A CLI interface is provided for viewing and editing Judy config files.

judy config

Run an evaluation as follows:

judy run --run-config run_config.yml --name disinfo-test --output ./results

After running an evaluation, you can serve a web app for viewing the results:

judy serve -r ./results

Web App Screenshots

The web app allows you to view your evaluation results.

Overview App Runs
Raw Results

Roadmap

Features

  • Core framework
  • Web app - to view evaluation results
  • Add perturbations - the ability to modify input datasets - with typos, synonymns etc.
  • Add adaptations - the ability to use different prompting techniques - such as Chain of Thought etc.

Scenarios

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change. Please make sure to update tests as appropriate. Check out the contribution guide for more details.

Citation - BibTeX

@software{Hutchinson_Judy_-_LLM_2024,
  author = {Hutchinson, Linden and Raghavan, Rahul},
  month = feb,
  title = {{Judy - LLM Evaluator}},
  url = {https://github.com/TNT-Hoopsnake/judy},
  version = {2.0.0},
  year = {2024}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

judyeval-2.1.0.tar.gz (68.1 kB view details)

Uploaded Source

Built Distribution

judyeval-2.1.0-py3-none-any.whl (85.4 kB view details)

Uploaded Python 3

File details

Details for the file judyeval-2.1.0.tar.gz.

File metadata

  • Download URL: judyeval-2.1.0.tar.gz
  • Upload date:
  • Size: 68.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.10.12 Linux/6.5.0-21-generic

File hashes

Hashes for judyeval-2.1.0.tar.gz
Algorithm Hash digest
SHA256 e41cf20493bb7e502959c254c0c6fc487eab0841c636515c41ac71385960480a
MD5 a940a62cc8e94df54268de49f429d361
BLAKE2b-256 0b2cf7cad8b7f0daffa62228daaa931cfc23e7582c8031fb2f7cf2491e3275c6

See more details on using hashes here.

File details

Details for the file judyeval-2.1.0-py3-none-any.whl.

File metadata

  • Download URL: judyeval-2.1.0-py3-none-any.whl
  • Upload date:
  • Size: 85.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.10.12 Linux/6.5.0-21-generic

File hashes

Hashes for judyeval-2.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f6206f09f50402750b13fa7e82c5cd8c5a801100c90987a061077f0be6cb7dae
MD5 b4f9d9589fb25a9a48fd2771552a8313
BLAKE2b-256 69840fdefcd82c25812815691d07172376e067c9f21630e8aa99868fcaa57d55

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page