JETTS Benchmark

These details have not been verified by PyPI

Project links

Project description

JETTS: Judge Evaluation for Test-Time-Scaling

Authors: Yilun Zhou*, Austin Xu*, Peifeng Wang, Caiming Xiong, Shafiq Joty

This repository contains the source code for the JETTS benchmark, introduced in the paper Evaluating Judges as Evaluators: The JETTS Benchmark of LLM-as-Judges as Test-Time Scaling Evaluators.

Setup

Install JETTS

We recommend installing this package in a fresh conda environment, as follows:

conda create -n jetts python=3.12 -y
conda activate jetts
pip install uv
git clone github.com/SalesforceAIResearch/jetts-benchmark
cd jetts-benchmark
uv pip install -e .

Alternatively, if you do not need to modify any source code, you can directly run pip install uv; uv pip install jetts in the command line after creating the conda environment.

Download Model Response Data

We generate and evaluate the responses generated for a set of generator models on our benchmark tasks so that, except for critique-based refinement, you do not need to run any generator models. These data are stored on Google Cloud and freely accessible to anyone, but you need to use the gcloud command line tool to download. You can follow this link to download and install it.

If you are working in this folder, we recommend creating a subfolder for these data files.

mkdir jetts_data
cd jetts_data

# data for reranking and refinement (143MB zipped, 650MB extracted)
gcloud storage cp gs://sfr-jetts-benchmark-data/reranking_and_refinement.tar.gz .
tar xzf reranking_and_refinement.tar.gz
rm reranking_and_refinement.tar.gz

# data for beam search (6.7GB zipped, 51GB extracted)
gcloud storage cp gs://sfr-jetts-benchmark-data/beam_search.tar.gz .
# this can take a while; to see a progress bar, you can use "pv beam_search.tar.gz | tar xz" after install "pv"
tar xzf beam_search.tar.gz
rm beam_search.tar.gz

If everything works correctly, you should see the following folders being created and populated:

jetts_data
├─ beam_search
│  └─ (more subfolders)
└─ reranking_and_refinement
   └─ (jsonl files)

Note: we are working on uploading the data files to Huggingface so that they can be downloaded automatically on the fly. Please stay tuned!

In reranking_and_refinement, each file represents the responses generated by a model for a particular dataset, contains up to 10 responses, named {dataset}_{generator_model}.jsonl.

In beam_search, each subfolder contains the fully expanded beam search trees generated by a model for a particular dataset, named {dataset}_{N}_{M}_{d}_{generator_model}, with 0.jsonl to {L-1}.jsonl corresponding to the L queries in the dataset. N, M and d correspond to the number of initial step samples, beam width and max depth of the search tree, as detailed in Sec. 3.3 of the paper.

Launch a Judge Model

To benchmark a specific judge, JETTS supports two methods of defining a judge model instance:

An OpenAI-compatible server (e.g., OpenAI models, Together AI models, and model servers launched by vllm serve).
A vllm.LLM object, which is wrapped in jetts.judge.vllm_judge.VllmJudge.

In the demos below, we use vllm serve to launch a model server in the first method. We provide a helper script to make this process easy:

python scripts/launch_judge.py --judge-model [JUDGE]

where [JUDGE] is one of the short or full names in the table below, or the huggingface model ID for any vllm-supported model.

Short Name	Full Name
prom7b	prometheus-eval/prometheus-7b-v2.0
sc8b	Skywork/Skywork-Critic-Llama-3.1-8B
ob8b	NCSOFT/Llama-3-OffsetBias-8B
thm8b	PKU-ONELab/Themis
prom8x7b	prometheus-eval/prometheus-8x7b-v2.0
sc70b	Skywork/Skywork-Critic-Llama-3.1-70B
ste70b	facebook/Self-taught-evaluator-llama3.1-70B
llama8b	meta-llama/Llama-3.1-8B-Instruct

The judge will be served on localhost:8000, as expected by the script for each task.

Note that due to company policy, we are not able to release weights for the SFR-Judge family of models or provide API access. We hope to do so in the future and will update this instruction accordingly.

Running JETTS Tasks

Response Reranking

With reranking_and_refinement data downloaded and judge launched, reranking can be run with

python scripts/reranking.py --data-file [DATA_FILE]

where [DATA_FILE] is path of one of the .jsonl files in the reranking_and_refinement data folder. The script automatically computes the performance at the end and writes a file containing ranked responses for the dataset. The file name is named {judge_model}_{data_file_name}_{reranking_method}.jsonl inside the folder specified by --output-dir, which defaults to the outputs/reranking folder. Please consult the script or run it with the -h flag for optional arguments to customize the run.

Step-Level Beam Search

With beam_search data downloaded and judge launched, beam search can be run with

python scripts/beam_search.py --input-dir [INPUT_DIR]

where [INPUT_DIR] is the path of one of the subfolders inside the beam_search data folder. The script automatically computes the performance at the end and creates a folder and files 0.jsonl to {L-1}.jsonl containing the beam search decisions for each tree. The result folder is named {judge_model}_{input_dir_name}_{beam_selection_reranking_method}_{final_selection_reranking_method}.jsonl inside the folder specified by --output-dir, which defaults to the outputs/beam_search folder. Please consult the script or run it with the -h flag for optional arguments to customize the run.

Critique-Based Refinement

In addition to downloading the reranking_and_refinement data and launching the judge, you also need to launch the generator as we need to perform live response generation following judge critiques. The generator can be launched in a similar manner as the judge, with

python scripts/launch_generator.py --generator-name [GENERATOR]

where [GENERATOR] is one of the short or full names in the table below, or the huggingface model ID for any vllm-supported model.

Short Name	Full Name
llama8b	meta-llama/Llama-3.1-8B-Instruct
llama70b	meta-llama/Llama-3.1-70B-Instruct
qwen32b	Qwen/Qwen2.5-32B-Instruct
qwen72b	Qwen/Qwen2.5-72B-Instruct

The generator will be served on localhost:8001, as expected by the refinement script, which can be executed as

python scripts/refinement.py --data-file [DATA_FILE]

where [DATA_FILE] is path of one of the .jsonl files in the reranking_and_refinement data folder. The script does not automatically compute the performance at the end but does write a file containing all refined responses in the final reranking order for the dataset. The file name is named {judge_model}_{refiner_model}_{data_file_name}_{final_reranking_method}.jsonl inside the folder specified by --output-dir, which defaults to the outputs/refinement folder. Please consult the script or run it with the -h flag for optional arguments to customize the run.

Evaluating Refinement Result

Note: you do not need to run manual evaluation for reranking and beam search as we have pre-computed the score for all model responses (including every leaf node in the search tree) and saved them with the data files. You only need to run this evaluation for refinement as the responses are freshly generated by the generator model.

In order to not intervene with the packages needed to run the actual benchmarking, we strongly recommend you to run the evalaution in a dedicated conda environment. Furthermore, since BigCodeBench requires many packages in their relatively old versions (e.g., numpy==1.21.2 released in August of 2021), its evaluation interferes with that of other datasets. Thus, we recommend following the steps below to create two environments.

To run evaluations for CHAMP and AlpacaEval, you need to have OPENAI_API_KEY saved as an environment variable (e.g., export OPENAI_API_KEY=[YOUR_API_KEY]). For CHAMP, you also need to keep the generator server (i.e., scripts/launch_generator.py) running at port 8001 during evaluation; the judge server can be terminated.

Note: all evaluations need to be run inside the jetts_eval directory.

Everything except for BigCodeBench

cd jetts_eval
conda create -n jetts-eval python=3.10 -y
conda activate jetts-eval
pip install uv

# You only need to run the line(s) below for the dataset(s) that you wish to evaluate
uv pip install math-verify[antlr4_13_2]  # for GSM8k and MATH
uv pip install champ_dataset datasets openai  # for CHAMP
uv pip install alpaca-eval  # for AlpacaEval
uv pip install numpy absl-py langdetect nltk immutabledict  # for IFEval

# run all lines below for HumanEval+ and MBPP+
cd local_code_eval
tar xzf evalplus.tar.gz
cd evalplus
uv pip install -e .
cd ..

BigCodeBench

conda create -n jetts-eval-bcb python=3.10 -y
conda activate jetts-eval-bcb
pip install uv
cd local_code_eval
tar xzf bigcodebench.tar.gz
cd bigcodebench
uv pip install -e .
uv pip install -r requirements-eval.txt

Running the evaluation

After you have installed the necessary packages and activated the correct environment, you can run

python evaluate_refinement.py --refinement-output-file [REFINEMENT_OUTPUT_FILE]

where [REFINEMENT_OUTPUT_FILE] is the path to the output file generated by scripts/refinement.py. By default, the file name contains the dataset name. However, if you provide a custom output file without dataset name in it, you need to additionally specify --dataset [DATASET] from the list of [gsm8k, math, champ, humaneval, mbpp, bigcodebench, alpacaeval, ifeval]. The program will print out the score at the end.

Questions?

If you have any questions, feel free to open an issue or contact the authors at yilun.zhou@salesforce.com and austin.xu@salesforce.com.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.0.1

Apr 21, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

jetts-0.0.1.tar.gz (11.4 kB view details)

Uploaded Apr 21, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

jetts-0.0.1-py3-none-any.whl (11.4 kB view details)

Uploaded Apr 21, 2025 Python 3

File details

Details for the file jetts-0.0.1.tar.gz.

File metadata

Download URL: jetts-0.0.1.tar.gz
Upload date: Apr 21, 2025
Size: 11.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.7

File hashes

Hashes for jetts-0.0.1.tar.gz
Algorithm	Hash digest
SHA256	`aab7c3f51831082bfd977c6defdeea1dfb9919cdc7d5d4d5f73f7bc330836522`
MD5	`16820766d901b14235025a19e0dd8597`
BLAKE2b-256	`cdbe42646c842bed33f11804a93b15649e19f07f1368ddc1fb7153d1b9e371e0`

See more details on using hashes here.

File details

Details for the file jetts-0.0.1-py3-none-any.whl.

File metadata

Download URL: jetts-0.0.1-py3-none-any.whl
Upload date: Apr 21, 2025
Size: 11.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.7

File hashes

Hashes for jetts-0.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`bc7be75712532c642f4dd20b60d8f4c1f9b1b765f6ce16c6a3b768746f03212a`
MD5	`29bcd5a0eae25099dba86dd164868553`
BLAKE2b-256	`1e9d44f78278d00a74e9f1dacc4ff6a41d1189ae2b5d194780f718b90a92d270`

See more details on using hashes here.

jetts 0.0.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

JETTS: Judge Evaluation for Test-Time-Scaling

Setup

Install JETTS

Download Model Response Data

Launch a Judge Model

Running JETTS Tasks

Response Reranking

Step-Level Beam Search

Critique-Based Refinement

Evaluating Refinement Result

Everything except for BigCodeBench

BigCodeBench

Running the evaluation

Questions?

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes