Judy is a python library and framework to evaluate the text-generation capabilities of Large Language Models (LLM) using a Judge LLM.
Project description
Judy
Judy is a python library and framework to evaluate the text-generation capabilities of Large Language Models (LLM) using a Judge LLM.
Judy allows users to evaluate LLMs using a competent Judge LLM (such as GPT-4). Users can choose from a set of predefined scenarios sourced from recent research, or design their own. A scenario is a specific test designed to evaluate a particular aspect of an LLM. A scenario consists of:
Dataset
: A source dataset to generate prompts to evaluate models against.Task
: A task to evaluate models on. Tasks for judge evaluations have been carefully designed by researchers to assess certain aspects of LLMs.Metric
: The metric(s) to use when evaluating the responses from a task. For example - accuracy, level of detail etc.
Judy has been inspired by techniques used in research including HELM [1] and LLM-as-a-judge [2].
- [1] Holistic Evaluation of Language Models - https://arxiv.org/abs/2211.09110
- [2] Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena - https://arxiv.org/abs/2306.05685
Installation
Use the package manager pip to install Judy. Note: Judy requires python >= 3.10.
pip install judyeval
Alternate Installation
You can also install Judy directly from this git repo:
pip install git+https://github.com/TNT-Hoopsnake/judy
Getting Started
Setup configs
Judy uses 3 configuration files during evaluation. Only the run config is strictly necessary to begin with:
Dataset Config
: Defines all of the datasets available to use in the evaluation run, how to download them and which class to use to format them. You don't have to worry about specifying this config unless you plan on adding new datasets.Judy
will automatically use the example dataset config here unless you specify an alternate one using--dataset-config
.Evaluation Config
: Defines all of the tasks and the metrics used to evaluate them. It also restricts which datasets and metrics can be used for each task. You don't have to worry about specifying this config unless you plan on adding new tasks or metrics.Judy
will automatically use the example eval config here unless you specify an alternate one using--eval-config
.Run Config
: Defines all of the settings to use for your evaluation run. The evaluation results for your run will store a copy (with sensitive details redacted) of these settings as metadata. An example run config is provided here
Setup model(s) to evaluate
Ensure you have API access to the models you wish to evaluate. We currently support two types of API formats:
OPENAI
: The OpenAI API ChatCompletion endpoint (ref)HUGGINGFACE
: The HuggingFace Hosted Inference API (ref)
If you are hosting models locally you can use a package like LocalAI to get an OpenAI compatible REST API which can be used by Judy
.
Judy Commands
A CLI interface is provided for viewing and editing Judy config files.
judy config
Run an evaluation as follows:
judy run --run-config run_config.yml --name disinfo-test --output ./results
After running an evaluation, you can serve a web app for viewing the results:
judy serve -r ./results
Web App Screenshots
The web app allows you to view your evaluation results.
Roadmap
Features
- Core framework
- Web app - to view evaluation results
- Add perturbations - the ability to modify input datasets - with typos, synonymns etc.
- Add adaptations - the ability to use different prompting techniques - such as Chain of Thought etc.
Scenarios
Contributing
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change. Please make sure to update tests as appropriate. Check out the contribution guide for more details.
Citation - BibTeX
@software{Hutchinson_Judy_-_LLM_2024,
author = {Hutchinson, Linden and Raghavan, Rahul},
month = feb,
title = {{Judy - LLM Evaluator}},
url = {https://github.com/TNT-Hoopsnake/judy},
version = {2.0.0},
year = {2024}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file judyeval-2.1.0.tar.gz
.
File metadata
- Download URL: judyeval-2.1.0.tar.gz
- Upload date:
- Size: 68.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.7.1 CPython/3.10.12 Linux/6.5.0-21-generic
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e41cf20493bb7e502959c254c0c6fc487eab0841c636515c41ac71385960480a |
|
MD5 | a940a62cc8e94df54268de49f429d361 |
|
BLAKE2b-256 | 0b2cf7cad8b7f0daffa62228daaa931cfc23e7582c8031fb2f7cf2491e3275c6 |
File details
Details for the file judyeval-2.1.0-py3-none-any.whl
.
File metadata
- Download URL: judyeval-2.1.0-py3-none-any.whl
- Upload date:
- Size: 85.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.7.1 CPython/3.10.12 Linux/6.5.0-21-generic
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f6206f09f50402750b13fa7e82c5cd8c5a801100c90987a061077f0be6cb7dae |
|
MD5 | b4f9d9589fb25a9a48fd2771552a8313 |
|
BLAKE2b-256 | 69840fdefcd82c25812815691d07172376e067c9f21630e8aa99868fcaa57d55 |