Judy is a python library and framework to evaluate the text-generation capabilities of Large Language Models (LLM) using a Judge LLM.

These details have not been verified by PyPI

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Judy

Judy is a python library and framework to evaluate the text-generation capabilities of Large Language Models (LLM) using a Judge LLM.

Judy allows users to evaluate LLMs using a competent Judge LLM (such as GPT-4). Users can choose from a set of predefined scenarios sourced from recent research, or design their own. A scenario is a specific test designed to evaluate a particular aspect of an LLM. A scenario consists of:

Dataset: A source dataset to generate prompts to evaluate models against.
Task: A task to evaluate models on. Tasks for judge evaluations have been carefully designed by researchers to assess certain aspects of LLMs.
Metric: The metric(s) to use when evaluating the responses from a task. For example - accuracy, level of detail etc.

Framework Overview

Judy has been inspired by techniques used in research including HELM [1] and LLM-as-a-judge [2].

[1] Holistic Evaluation of Language Models - https://arxiv.org/abs/2211.09110
[2] Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena - https://arxiv.org/abs/2306.05685

Installation

Use the package manager pip to install Judy. Note: Judy requires python >= 3.10.

pip install judyeval

Alternate Installation

You can also install Judy directly from this git repo:

pip install git+https://github.com/TNT-Hoopsnake/judy

Getting Started

Setup configs

Judy uses 3 configuration files during evaluation. Only the run config is strictly necessary to begin with:

Dataset Config: Defines all of the datasets available to use in the evaluation run, how to download them and which class to use to format them. You don't have to worry about specifying this config unless you plan on adding new datasets. Judy will automatically use the example dataset config here unless you specify an alternate one using --dataset-config.
Evaluation Config: Defines all of the tasks and the metrics used to evaluate them. It also restricts which datasets and metrics can be used for each task. You don't have to worry about specifying this config unless you plan on adding new tasks or metrics. Judy will automatically use the example eval config here unless you specify an alternate one using --eval-config.
Run Config: Defines all of the settings to use for your evaluation run. The evaluation results for your run will store a copy (with sensitive details redacted) of these settings as metadata. An example run config is provided here

Setup model(s) to evaluate

Ensure you have API access to the models you wish to evaluate. We currently support two types of API formats:

OPENAI: The OpenAI API ChatCompletion endpoint (ref)
HUGGINGFACE: The HuggingFace Hosted Inference API (ref)

If you are hosting models locally you can use a package like LocalAI to get an OpenAI compatible REST API which can be used by Judy.

Judy Commands

A CLI interface is provided for viewing and editing Judy config files.

judy config

Run an evaluation as follows:

judy run --run-config run_config.yml --name disinfo-test --output ./results

After running an evaluation, you can serve a web app for viewing the results:

judy serve -r ./results

Web App Screenshots

The web app allows you to view your evaluation results.

Roadmap

Features

Core framework
Web app - to view evaluation results
Add perturbations - the ability to modify input datasets - with typos, synonymns etc.
Add adaptations - the ability to use different prompting techniques - such as Chain of Thought etc.

Scenarios

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change. Please make sure to update tests as appropriate. Check out the contribution guide for more details.

Citation - BibTeX

@software{Hutchinson_Judy_-_LLM_2024,
  author = {Hutchinson, Linden and Raghavan, Rahul},
  month = feb,
  title = {{Judy - LLM Evaluator}},
  url = {https://github.com/TNT-Hoopsnake/judy},
  version = {2.0.0},
  year = {2024}
}

Project details

These details have not been verified by PyPI

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

2.1.0

Feb 25, 2024

2.0.0

Feb 1, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

judyeval-2.1.0.tar.gz (68.1 kB view hashes)

Uploaded Feb 25, 2024 Source

Built Distribution

judyeval-2.1.0-py3-none-any.whl (85.4 kB view hashes)

Uploaded Feb 25, 2024 Python 3

Hashes for judyeval-2.1.0.tar.gz

Hashes for judyeval-2.1.0.tar.gz
Algorithm	Hash digest
SHA256	`e41cf20493bb7e502959c254c0c6fc487eab0841c636515c41ac71385960480a`
MD5	`a940a62cc8e94df54268de49f429d361`
BLAKE2b-256	`0b2cf7cad8b7f0daffa62228daaa931cfc23e7582c8031fb2f7cf2491e3275c6`

Hashes for judyeval-2.1.0-py3-none-any.whl

Hashes for judyeval-2.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f6206f09f50402750b13fa7e82c5cd8c5a801100c90987a061077f0be6cb7dae`
MD5	`b4f9d9589fb25a9a48fd2771552a8313`
BLAKE2b-256	`69840fdefcd82c25812815691d07172376e067c9f21630e8aa99868fcaa57d55`