Skip to main content

JETTS Benchmark

Project description

JETTS: Judge Evaluation for Test-Time-Scaling

Authors: Yilun Zhou*, Austin Xu*, Peifeng Wang, Caiming Xiong, Shafiq Joty

This repository contains the source code for the JETTS benchmark, introduced in the paper Evaluating Judges as Evaluators: The JETTS Benchmark of LLM-as-Judges as Test-Time Scaling Evaluators.

Setup

Install JETTS

We recommend installing this package in a fresh conda environment, as follows:

conda create -n jetts python=3.12 -y
conda activate jetts
pip install uv
git clone github.com/SalesforceAIResearch/jetts-benchmark
cd jetts-benchmark
uv pip install -e .

Alternatively, if you do not need to modify any source code, you can directly run pip install uv; uv pip install jetts in the command line after creating the conda environment.

Download Model Response Data

We generate and evaluate the responses generated for a set of generator models on our benchmark tasks so that, except for critique-based refinement, you do not need to run any generator models. These data are stored on Google Cloud and freely accessible to anyone, but you need to use the gcloud command line tool to download. You can follow this link to download and install it.

If you are working in this folder, we recommend creating a subfolder for these data files.

mkdir jetts_data
cd jetts_data

# data for reranking and refinement (143MB zipped, 650MB extracted)
gcloud storage cp gs://sfr-jetts-benchmark-data/reranking_and_refinement.tar.gz .
tar xzf reranking_and_refinement.tar.gz
rm reranking_and_refinement.tar.gz

# data for beam search (6.7GB zipped, 51GB extracted)
gcloud storage cp gs://sfr-jetts-benchmark-data/beam_search.tar.gz .
# this can take a while; to see a progress bar, you can use "pv beam_search.tar.gz | tar xz" after install "pv"
tar xzf beam_search.tar.gz
rm beam_search.tar.gz

If everything works correctly, you should see the following folders being created and populated:

jetts_data
├─ beam_search
│  └─ (more subfolders)
└─ reranking_and_refinement
   └─ (jsonl files)

Note: we are working on uploading the data files to Huggingface so that they can be downloaded automatically on the fly. Please stay tuned!

In reranking_and_refinement, each file represents the responses generated by a model for a particular dataset, contains up to 10 responses, named {dataset}_{generator_model}.jsonl.

In beam_search, each subfolder contains the fully expanded beam search trees generated by a model for a particular dataset, named {dataset}_{N}_{M}_{d}_{generator_model}, with 0.jsonl to {L-1}.jsonl corresponding to the L queries in the dataset. N, M and d correspond to the number of initial step samples, beam width and max depth of the search tree, as detailed in Sec. 3.3 of the paper.

Launch a Judge Model

To benchmark a specific judge, JETTS supports two methods of defining a judge model instance:

  1. An OpenAI-compatible server (e.g., OpenAI models, Together AI models, and model servers launched by vllm serve).
  2. A vllm.LLM object, which is wrapped in jetts.judge.vllm_judge.VllmJudge.

In the demos below, we use vllm serve to launch a model server in the first method. We provide a helper script to make this process easy:

python scripts/launch_judge.py --judge-model [JUDGE]

where [JUDGE] is one of the short or full names in the table below, or the huggingface model ID for any vllm-supported model.

Short Name Full Name
prom7b prometheus-eval/prometheus-7b-v2.0
sc8b Skywork/Skywork-Critic-Llama-3.1-8B
ob8b NCSOFT/Llama-3-OffsetBias-8B
thm8b PKU-ONELab/Themis
prom8x7b prometheus-eval/prometheus-8x7b-v2.0
sc70b Skywork/Skywork-Critic-Llama-3.1-70B
ste70b facebook/Self-taught-evaluator-llama3.1-70B
llama8b meta-llama/Llama-3.1-8B-Instruct

The judge will be served on localhost:8000, as expected by the script for each task.

Note that due to company policy, we are not able to release weights for the SFR-Judge family of models or provide API access. We hope to do so in the future and will update this instruction accordingly.

Running JETTS Tasks

Response Reranking

With reranking_and_refinement data downloaded and judge launched, reranking can be run with

python scripts/reranking.py --data-file [DATA_FILE]

where [DATA_FILE] is path of one of the .jsonl files in the reranking_and_refinement data folder. The script automatically computes the performance at the end and writes a file containing ranked responses for the dataset. The file name is named {judge_model}_{data_file_name}_{reranking_method}.jsonl inside the folder specified by --output-dir, which defaults to the outputs/reranking folder. Please consult the script or run it with the -h flag for optional arguments to customize the run.

Step-Level Beam Search

With beam_search data downloaded and judge launched, beam search can be run with

python scripts/beam_search.py --input-dir [INPUT_DIR]

where [INPUT_DIR] is the path of one of the subfolders inside the beam_search data folder. The script automatically computes the performance at the end and creates a folder and files 0.jsonl to {L-1}.jsonl containing the beam search decisions for each tree. The result folder is named {judge_model}_{input_dir_name}_{beam_selection_reranking_method}_{final_selection_reranking_method}.jsonl inside the folder specified by --output-dir, which defaults to the outputs/beam_search folder. Please consult the script or run it with the -h flag for optional arguments to customize the run.

Critique-Based Refinement

In addition to downloading the reranking_and_refinement data and launching the judge, you also need to launch the generator as we need to perform live response generation following judge critiques. The generator can be launched in a similar manner as the judge, with

python scripts/launch_generator.py --generator-name [GENERATOR]

where [GENERATOR] is one of the short or full names in the table below, or the huggingface model ID for any vllm-supported model.

Short Name Full Name
llama8b meta-llama/Llama-3.1-8B-Instruct
llama70b meta-llama/Llama-3.1-70B-Instruct
qwen32b Qwen/Qwen2.5-32B-Instruct
qwen72b Qwen/Qwen2.5-72B-Instruct

The generator will be served on localhost:8001, as expected by the refinement script, which can be executed as

python scripts/refinement.py --data-file [DATA_FILE]

where [DATA_FILE] is path of one of the .jsonl files in the reranking_and_refinement data folder. The script does not automatically compute the performance at the end but does write a file containing all refined responses in the final reranking order for the dataset. The file name is named {judge_model}_{refiner_model}_{data_file_name}_{final_reranking_method}.jsonl inside the folder specified by --output-dir, which defaults to the outputs/refinement folder. Please consult the script or run it with the -h flag for optional arguments to customize the run.

Evaluating Refinement Result

Note: you do not need to run manual evaluation for reranking and beam search as we have pre-computed the score for all model responses (including every leaf node in the search tree) and saved them with the data files. You only need to run this evaluation for refinement as the responses are freshly generated by the generator model.

In order to not intervene with the packages needed to run the actual benchmarking, we strongly recommend you to run the evalaution in a dedicated conda environment. Furthermore, since BigCodeBench requires many packages in their relatively old versions (e.g., numpy==1.21.2 released in August of 2021), its evaluation interferes with that of other datasets. Thus, we recommend following the steps below to create two environments.

To run evaluations for CHAMP and AlpacaEval, you need to have OPENAI_API_KEY saved as an environment variable (e.g., export OPENAI_API_KEY=[YOUR_API_KEY]). For CHAMP, you also need to keep the generator server (i.e., scripts/launch_generator.py) running at port 8001 during evaluation; the judge server can be terminated.

Note: all evaluations need to be run inside the jetts_eval directory.

Everything except for BigCodeBench

cd jetts_eval
conda create -n jetts-eval python=3.10 -y
conda activate jetts-eval
pip install uv

# You only need to run the line(s) below for the dataset(s) that you wish to evaluate
uv pip install math-verify[antlr4_13_2]  # for GSM8k and MATH
uv pip install champ_dataset datasets openai  # for CHAMP
uv pip install alpaca-eval  # for AlpacaEval
uv pip install numpy absl-py langdetect nltk immutabledict  # for IFEval

# run all lines below for HumanEval+ and MBPP+
cd local_code_eval
tar xzf evalplus.tar.gz
cd evalplus
uv pip install -e .
cd ..

BigCodeBench

conda create -n jetts-eval-bcb python=3.10 -y
conda activate jetts-eval-bcb
pip install uv
cd local_code_eval
tar xzf bigcodebench.tar.gz
cd bigcodebench
uv pip install -e .
uv pip install -r requirements-eval.txt

Running the evaluation

After you have installed the necessary packages and activated the correct environment, you can run

python evaluate_refinement.py --refinement-output-file [REFINEMENT_OUTPUT_FILE]

where [REFINEMENT_OUTPUT_FILE] is the path to the output file generated by scripts/refinement.py. By default, the file name contains the dataset name. However, if you provide a custom output file without dataset name in it, you need to additionally specify --dataset [DATASET] from the list of [gsm8k, math, champ, humaneval, mbpp, bigcodebench, alpacaeval, ifeval]. The program will print out the score at the end.

Questions?

If you have any questions, feel free to open an issue or contact the authors at yilun.zhou@salesforce.com and austin.xu@salesforce.com.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

jetts-0.0.1.tar.gz (11.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

jetts-0.0.1-py3-none-any.whl (11.4 kB view details)

Uploaded Python 3

File details

Details for the file jetts-0.0.1.tar.gz.

File metadata

  • Download URL: jetts-0.0.1.tar.gz
  • Upload date:
  • Size: 11.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.7

File hashes

Hashes for jetts-0.0.1.tar.gz
Algorithm Hash digest
SHA256 aab7c3f51831082bfd977c6defdeea1dfb9919cdc7d5d4d5f73f7bc330836522
MD5 16820766d901b14235025a19e0dd8597
BLAKE2b-256 cdbe42646c842bed33f11804a93b15649e19f07f1368ddc1fb7153d1b9e371e0

See more details on using hashes here.

File details

Details for the file jetts-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: jetts-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 11.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.7

File hashes

Hashes for jetts-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 bc7be75712532c642f4dd20b60d8f4c1f9b1b765f6ce16c6a3b768746f03212a
MD5 29bcd5a0eae25099dba86dd164868553
BLAKE2b-256 1e9d44f78278d00a74e9f1dacc4ff6a41d1189ae2b5d194780f718b90a92d270

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page