Skip to main content

LiveCodeBench - packaged by NVIDIA

Project description

LiveCodeBench

Competitive Analysis Instructions

Official repository for the paper "LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code"

🏠 Home Page💻 Data 🏆 Leaderboard🔍 Explorer

Introduction

LiveCodeBench provides holistic and contamination-free evaluation of coding capabilities of LLMs. Particularly, LiveCodeBench continuously collects new problems over time from contests across three competition platforms -- LeetCode, AtCoder, and CodeForces. Next, LiveCodeBench also focuses on a broader range of code-related capabilities, such as self-repair, code execution, and test output prediction, beyond just code generation. Currently, LiveCodeBench hosts four hundred high-quality coding problems that were published between May 2023 and March 2024.

Attribution

This project builds upon and extends the SciCode benchmark, a research coding benchmark curated by scientists. We acknowledge the original authors and their work in creating this foundational benchmark for evaluating language models' scientific coding capabilities. For complete attribution details, please see ATTRIBUTION.md.

Installation

You can clone the repository using the following command:

git clone https://github.com/LiveCodeBench/LiveCodeBench.git
cd LiveCodeBench

We recommend using uv for managing dependencies. You can install uv and the dependencies using the following commands:

uv venv --python 3.11
source .venv/bin/activate

uv pip install -e .

Data

We provide a benchmark for different code capability scenarios

Inference and Evaluation

Dataset Versions

Since LiveCodeBench is a continuously updated benchmark, we provide different versions of the dataset. Particularly, we provide the following versions of the dataset:

  • release_v1: The initial release of the dataset with problems released between May 2023 and Mar 2024 containing 400 problems.
  • release_v2: The updated release of the dataset with problems released between May 2023 and May 2024 containing 511 problems.
  • release_v3: The updated release of the dataset with problems released between May 2023 and Jul 2024 containing 612 problems.
  • release_v4: The updated release of the dataset with problems released between May 2023 and Sep 2024 containing 713 problems.
  • release_v5: The updated release of the dataset with problems released between May 2023 and Jan 2025 containing 880 problems.

You can use the --release_version flag to specify the dataset version you wish to use. Particularly, you can use the following command to run the evaluation on the release_v2 dataset. Release version defaults to release_latest. Additionally, we have introduced fine-grained release versions such as v1, v2, v1_v3, v4_v5 for specific versions of the dataset.

livecodebench --model {model_name} --scenario codegeneration --evaluate --release_version release_v2

Code Generation

We use vllm for inference using open models. By default, we use tensor_parallel_size=${num_gpus} to parallelize inference across all available GPUs. It can be configured using the --tensor_parallel_size flag as required.

For running the inference, please provide the model_name based on the ./livecodebench/lm_styles.py file. The scenario (here codegeneration) can be used to specify the scenario for the model.

livecodebench --model {model_name} --scenario codegeneration

Additionally, --use_cache flag can be used to cache the generated outputs and --continue_existing flag can be used to use the existing dumped results. In case you wish to use model from a local path, you can additionally provide --local_model_path flag with the path to the model. We use n=10 and temperature=0.2 for generation. Please check the ./livecodebench/runner/parser.py file for more details on the flags.

For closed API models, --multiprocess flag can be used to parallelize queries to API servers (adjustable according to rate limits).

Evaluation

We compute pass@1 and pass@5 metrics for model evaluations. We use a modified version of the checker released with the apps benchmark to compute the metrics. Particularly, we identified some unhandled edge cases in the original checker and fixed them and additionally simplified the checker based on our collected dataset. To run the evaluation, you can add the --evaluate flag:

livecodebench --model {model_name} --scenario codegeneration --evaluate

Note that time limits can cause slight (< 0.5) points of variation in the computation of the pass@1 and pass@5 metrics. If you observe a significant variation in performance, adjust the --num_process_evaluate flag to a lower value or increase the --timeout flag. Please report particular issues caused by improper timeouts here.

Finally, to get scores over different time windows, you can use ./livecodebench/evaluation/compute_scores.py file. Particularly, you can provide --start_date and --end_date flags (using the YYYY-MM-DD format) to get scores over the specified time window. In our paper, to counter contamination in the DeepSeek models, we only report results on problems released after August 2023. You can replicate those evaluations using:

python -m livecodebench.evaluation.compute_scores --eval_all_file {saved_eval_all_file} --start_date 2023-09-01

NOTE: We have pruned a large number of test cases from the original benchmark and created code_generation_lite which is set as the default benchmark offering similar performance estimation much faster. If you wish to use the original benchmark, please use the --not_fast flag. We are in the process of updating the leaderboard scores with this updated setting.

NOTE: V2 Update: to run the update LiveCodeBench please use --release_version release_v2. In addition, if you have existing results from release_v1 you can add --continue_existing or better --continue_existing_with_eval flags to reuse the old completions or evaluations respectively.

Self Repair

For running self repair, you need to provide an additional --codegen_n flag that maps to the number of codes that were generated during code generation. Additionally, the --temperature flag is used to resolve the old code generation eval file which must be present in the output directory.

livecodebench --model {model_name --scenario selfrepair --codegen_n {num_codes_codegen} --n 1 # only n=1 supported

In case you have results on a smaller subset or version of the benchmark, you can use --continue_existing and --continue_existing_with_eval flags to reuse the old computations. Particularly, you can run the following command to continue from existing generated solutions.

livecodebench --model {model_name} --scenario selfrepair --evaluate --continue_existing

Note that this will only reuse the generated samples and rerun evaluations. To reuse the old evaluations, you can add the --continue_existing_with_eval flag.

Test Output Prediction

For running the test output prediction scenario you can simply run

livecodebench --model {model_name} --scenario testoutputprediction --evaluate

Code Execution

For running the test output prediction scenario you can simply run

livecodebench --model {model_name} --scenario codeexecution --evaluate

Additionally, we support the COT setting with

livecodebench --model {model_name} --scenario codeexecution --cot_code_execution --evaluate

Custom Evaluation

Alternatively, you can using livecodebench/runner/custom_evaluator.py to directly evaluated model generations in a custom file. The file should contain a list of model outputs, appropirately formatted for evaluation in the order of benchmark problems.

python -m livecodebench.runner.custom_evaluator --custom_output_file {path_to_custom_outputs}

Particularly, arrange the outputs in the following format

[
    {"question_id": "id1", "code_list": ["code1", "code2"]},
    {"question_id": "id2", "code_list": ["code1", "code2"]}
]

Adding Support for New Models

To add support for new models, we have implemented an extensible framework to add new models and customize prompts appropirately.

Step 1: Add a new model to the ./livecodebench/lm_styles.py file. Particularly, extend the LMStyle class to add a new model family and extend the model to the LanguageModelList array.

Step 2: Since we use instruction tuned models, we allow configuring the instruction for each model. Modify the ./livecodebench/prompts/generation.py file to add a new prompt for the model in the format_prompt_generation function. For example, the prompt for DeepSeekCodeInstruct family of models looks as follows

# ./livecodebench/prompts/generation.py
if LanguageModelStyle == LMStyle.DeepSeekCodeInstruct:
    prompt = f"{PromptConstants.SYSTEM_MESSAGE_DEEPSEEK}\n\n"
    prompt += f"{get_deepseekcode_question_template_answer(question)}"
    return prompt

Submit Models to Leaderboard

We are currently only accepting submissions for only the code generation scenario. To submit models you can create a pull request on our submissions. Particularly, you can copy your model generations folder from output to the submissions folder and create a pull request. We will review the submission and add the model to the leaderboard accordingly.

ERRATA

We maintain a list of known issues and updates in the ERRATA.md file. Particularly, we document issues regarding erroneous tests and problems not amenable to autograding. We are constantly using this feedback to improve our problem selection heuristics as we update LiveCodeBench.

Results

LiveCodeBench can be used to evaluate performance of LLMs on different time-windows (using problem release date to filter the models). Thus we can detect and prevent potential contamination in the evaluation process and evaluate LLMs on new problems.

Code Generation Live Evaluation Test Output Prediction Live Evaluation

Next, we evaluate models on different code capabilities and find that relative performances of models do change over tasks (left). Thus, it highlights the need for holistic evaluation of LLMs for code.

Holistic Tasks Evaluation Comparing LCB vs HumanEval

We also find evidence of possible overfitting on HumanEval (right). Particularly, models that perform well on HumanEval do not necessarily perform well on LiveCodeBench. In the scatterplot above, we find the models get clustered into two groups, shaded in red and green. The red group contains models that perform well on HumanEval but poorly on LiveCodeBench, while the green group contains models that perform well on both.

For more details, please refer to our website at livecodebench.github.io.

NVIDIA NeMo Evaluator

LiveCodeBench provides evaluation clients specifically built to evaluate model endpoints using our Standard API for code generation, code execution, and test output prediction tasks.

Launching an evaluation for an LLM

Install the package

pip install nvidia-livecodebench

(Optional) Set a token to your API endpoint if it's protected

export MY_API_KEY="your_api_key_here"
export HF_TOKEN="your_huggingface_token_here"

List the available evaluations

nemo-evaluator ls

Available tasks:

  • codegeneration_release_latest
  • codegeneration_release_v1
  • codegeneration_release_v2
  • codegeneration_release_v3
  • codegeneration_release_v4
  • codegeneration_release_v5
  • codegeneration_release_v6
  • codegeneration_notfast
  • testoutputprediction
  • codeexecution_v2
  • codeexecution_v2_cot
  • AA_code_generation
  • nemo_code_generation

Run the evaluation of your choice

nemo-evaluator run_eval \
    --eval_type codegeneration_release_latest \
    --model_id meta/llama-3.1-8b-instruct \
    --model_url https://integrate.api.nvidia.com/v1/chat/completions \
    --model_type chat \
    --api_key_name MY_API_KEY \
    --output_dir /workspace/results

Gather the results

cat /workspace/results/results.yml

Command-Line Tool

Each package comes pre-installed with a set of command-line tools, designed to simplify the execution of evaluation tasks. Below are the available commands and their usage for the LiveCodeBench evaluations:

Commands

1. List Evaluation Types
nemo-evaluator ls

Displays the evaluation types available within the harness.

2. Run an evaluation

The nemo-evaluator run_eval command executes the evaluation process. Below are the flags and their descriptions:

Required flags:

  • --eval_type <string>: The type of evaluation to perform (e.g., codegeneration_release_latest, testoutputprediction, etc.)
  • --model_id <string>: The name or identifier of the model to evaluate.
  • --model_url <url>: The API endpoint where the model is accessible.
  • --model_type <string>: The type of the model to evaluate, currently either "chat" or "completions".
  • --output_dir <directory>: The directory to use as the working directory for the evaluation. The results, including the results.yml output file, will be saved here.

Optional flags:

  • --api_key_name <string>: The name of the environment variable that stores the Bearer token for the API, if authentication is required.
  • --run_config <path>: Specifies the path to a YAML file containing the evaluation definition.
  • --overrides <string>: Override configuration parameters (e.g., 'config.params.limit_samples=10').

Examples

Basic Code Generation Evaluation
nemo-evaluator run_eval \
    --eval_type codegeneration_release_latest \
    --model_id meta/llama-3.1-8b-instruct \
    --model_type chat \
    --model_url https://integrate.api.nvidia.com/v1/chat/completions \
    --output_dir ./evaluation_results
Code Generation with Authentication

If the model API requires authentication, set the API key in an environment variable and reference it using the --api_key_name flag:

export MY_API_KEY="your_api_key_here"

nemo-evaluator run_eval \
    --eval_type codegeneration_release_latest \
    --model_id meta/llama-3.1-8b-instruct \
    --model_type chat \
    --model_url https://integrate.api.nvidia.com/v1/chat/completions \
    --api_key_name MY_API_KEY \
    --output_dir ./evaluation_results
Test Output Prediction
nemo-evaluator run_eval \
    --eval_type testoutputprediction \
    --model_id meta/llama-3.1-8b-instruct \
    --model_type chat \
    --model_url https://integrate.api.nvidia.com/v1/chat/completions \
    --output_dir ./test_output_results
Code Execution with Chain-of-Thought
nemo-evaluator run_eval \
    --eval_type codeexecution_v2_cot \
    --model_id meta/llama-3.1-8b-instruct \
    --model_type chat \
    --model_url https://integrate.api.nvidia.com/v1/chat/completions \
    --output_dir ./code_execution_results
Limited Sample Evaluation
nemo-evaluator run_eval \
    --eval_type codegeneration_release_latest \
    --model_id meta/llama-3.1-8b-instruct \
    --model_type chat \
    --model_url https://integrate.api.nvidia.com/v1/chat/completions \
    --output_dir ./evaluation_results \
    --overrides 'config.params.limit_samples=10'

Configuring evaluations via YAML

Evaluations in LiveCodeBench are configured using YAML files that define the parameters and settings required for the evaluation process. These configuration files follow a standard API which ensures consistency across evaluations.

Example of a YAML config:

config:
  type: codegeneration_release_latest
  params:
    parallelism: 10
    limit_samples: 20
    max_new_tokens: 4096
    temperature: 0.0
    top_p: 0.00001
    extra:
      n_samples: 10
      num_process_evaluate: 32
      cache_batch_size: 10
      release_version: release_latest
target:
  api_endpoint:
    model_id: meta/llama-3.1-8b-instruct
    type: chat
    url: https://integrate.api.nvidia.com/v1/chat/completions
    api_key_name: MY_API_KEY

The priority of overrides is as follows:

  1. command line arguments
  2. user config (as seen above)
  3. task defaults (defined per task type)
  4. framework defaults

The --dry_run option allows you to print the final run configuration and command without executing the evaluation.

Example:

nemo-evaluator run_eval \
    --eval_type codegeneration_release_latest \
    --model_id meta/llama-3.1-8b-instruct \
    --model_type chat \
    --model_url https://integrate.api.nvidia.com/v1/chat/completions \
    --output_dir ./evaluation_results \
    --dry_run

Evaluation Types

Code Generation Tasks

  • codegeneration_release_latest: Latest version of the code generation benchmark
  • codegeneration_release_v1: Initial release (400 problems, May 2023 - Mar 2024)
  • codegeneration_release_v2: Updated release (511 problems, May 2023 - May 2024)
  • codegeneration_release_v3: Updated release (612 problems, May 2023 - Jul 2024)
  • codegeneration_release_v4: Updated release (713 problems, May 2023 - Sep 2024)
  • codegeneration_release_v5: Updated release (880 problems, May 2023 - Jan 2025)
  • codegeneration_release_v6: Updated release (1055 problems, May 2023 - Apr 2025)
  • codegeneration_notfast: Full test suite version (slower but more comprehensive)

Test Output Prediction

  • testoutputprediction: Predict test outputs given problem descriptions and inputs

Code Execution Tasks

  • codeexecution_v2: Execute code on given inputs
  • codeexecution_v2_cot: Code execution with Chain-of-Thought reasoning

Specialized Evaluations

  • AA_code_generation: Code generation on specific date range (Jul 2024 - Jan 2025)
  • nemo_code_generation: Code generation on specific date range (Aug 2024 - Feb 2025)

Performance Optimizations

LiveCodeBench includes several performance optimizations to ensure efficient evaluation:

  • Reduced timeouts: 3 seconds per test case (vs 6 seconds originally)
  • Increased parallelization: 32 evaluation processes (vs 12 originally)
  • Early termination: Stop testing after 3 consecutive failures
  • Capped global timeouts: Maximum 20 seconds per task
  • Batch processing: Process evaluations in batches for reduced overhead

These optimizations reduce evaluation time from ~143 hours to ~1.1 hours while maintaining accuracy.

Output Format

The evaluation results are saved in the specified output directory with the following structure:

output_dir/
├── results.yml          # Main results file
├── generations.json     # Model generations
├── generations_eval.json # Evaluation results
└── generations_eval_all.json # Detailed evaluation results

The results.yml file contains the main evaluation metrics including pass@1, pass@5, and other relevant scores.

Citation

@article{jain2024livecodebench,
  author    = {Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, Ion Stoica},
  title     = {LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code},
  year      = {2024},
  journal   = {arXiv preprint},
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

nvidia_livecodebench-26.3-py3-none-any.whl (92.8 kB view details)

Uploaded Python 3

File details

Details for the file nvidia_livecodebench-26.3-py3-none-any.whl.

File metadata

File hashes

Hashes for nvidia_livecodebench-26.3-py3-none-any.whl
Algorithm Hash digest
SHA256 d7d7cd2a5c3123b172ad138130abb53da9a81e8e47fb1e38ab66277030456fa8
MD5 06d924b1d395e4bef43d85fdc3969eb8
BLAKE2b-256 5a3d6fa1107ce90f550cf7089ad42ee86836479ea7fde4c005886c4ce980860a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page