Skip to main content

SciCode - packaged by NVIDIA

Project description

SciCode

Homepage | Paper

NVIDIA NeMo Evaluator

SciCode provides evaluation clients specifically built to evaluate model endpoints using our Standard API for scientific code generation tasks.

Launching an evaluation for an LLM

Install the package

pip install nvidia-scicode

(Optional) Set a token to your API endpoint if it's protected

export MY_API_KEY="your_api_key_here"
export HF_TOKEN="your_huggingface_token_here"

List the available evaluations

nemo-evaluator ls

Available tasks:

  • scicode
  • scicode_background
  • aa_scicode

Run the evaluation of your choice

nemo-evaluator run_eval \
    --eval_type aa_scicode \
    --model_id meta/llama-3.1-8b-instruct \
    --model_url https://integrate.api.nvidia.com/v1/chat/completions \
    --model_type chat \
    --api_key_name MY_API_KEY \
    --output_dir /workspace/results

Gather the results

cat /workspace/results/results.yml

Command-Line Tool

Each package comes pre-installed with a set of command-line tools, designed to simplify the execution of evaluation tasks. Below are the available commands and their usage for the SciCode evaluations:

Commands

1. List Evaluation Types
nemo-evaluator ls

Displays the evaluation types available within the harness.

2. Run an evaluation

The nemo-evaluator run_eval command executes the evaluation process. Below are the flags and their descriptions:

Required flags:

  • --eval_type <string>: The type of evaluation to perform (e.g., scicode, scicode_background, aa_scicode)
  • --model_id <string>: The name or identifier of the model to evaluate.
  • --model_url <url>: The API endpoint where the model is accessible.
  • --model_type <string>: The type of the model to evaluate, currently either "chat", "completions", or "vlm".
  • --output_dir <directory>: The directory to use as the working directory for the evaluation. The results, including the results.yml output file, will be saved here.

Optional flags:

  • --api_key_name <string>: The name of the environment variable that stores the Bearer token for the API, if authentication is required.
  • --run_config <path>: Specifies the path to a YAML file containing the evaluation definition.
  • --overrides <string>: Override configuration parameters (e.g., 'config.params.limit_samples=10').

Examples

Basic SciCode Evaluation
nemo-evaluator run_eval \
    --eval_type scicode \
    --model_id meta/llama-3.1-8b-instruct \
    --model_type chat \
    --model_url https://integrate.api.nvidia.com/v1/chat/completions \
    --output_dir ./evaluation_results
SciCode with Background Information
nemo-evaluator run_eval \
    --eval_type scicode_background \
    --model_id meta/llama-3.1-8b-instruct \
    --model_type chat \
    --model_url https://integrate.api.nvidia.com/v1/chat/completions \
    --output_dir ./evaluation_results
AA-SciCode Evaluation (Artificial Analysis Style)
nemo-evaluator run_eval \
    --eval_type aa_scicode \
    --model_id meta/llama-3.1-8b-instruct \
    --model_type chat \
    --model_url https://integrate.api.nvidia.com/v1/chat/completions \
    --output_dir ./evaluation_results
SciCode with Authentication

If the model API requires authentication, set the API key in an environment variable and reference it using the --api_key_name flag:

export MY_API_KEY="your_api_key_here"

nemo-evaluator run_eval \
    --eval_type scicode \
    --model_id meta/llama-3.1-8b-instruct \
    --model_type chat \
    --model_url https://integrate.api.nvidia.com/v1/chat/completions \
    --api_key_name MY_API_KEY \
    --output_dir ./evaluation_results
Limited Sample Evaluation
nemo-evaluator run_eval \
    --eval_type scicode \
    --model_id meta/llama-3.1-8b-instruct \
    --model_type chat \
    --model_url https://integrate.api.nvidia.com/v1/chat/completions \
    --output_dir ./evaluation_results \
    --overrides 'config.params.limit_samples=10'

Configuring evaluations via YAML

Evaluations in SciCode are configured using YAML files that define the parameters and settings required for the evaluation process. These configuration files follow a standard API which ensures consistency across evaluations.

Example of a YAML config:

config:
  type: scicode
  params:
    parallelism: 10
    limit_samples: 20
    max_new_tokens: 2048
    temperature: 0.0
    top_p: 0.00001
    request_timeout: 60
    max_retries: 2
    extra:
      n_samples: 1
      with_background: false
      include_dev: false
      eval_threads: null
target:
  api_endpoint:
    model_id: meta/llama-3.1-8b-instruct
    type: chat
    url: https://integrate.api.nvidia.com/v1/chat/completions
    api_key_name: MY_API_KEY

The priority of overrides is as follows:

  1. command line arguments
  2. user config (as seen above)
  3. task defaults (defined per task type)
  4. framework defaults

The --dry_run option allows you to print the final run configuration and command without executing the evaluation.

Example:

nemo-evaluator run_eval \
    --eval_type scicode \
    --model_id meta/llama-3.1-8b-instruct \
    --model_type chat \
    --model_url https://integrate.api.nvidia.com/v1/chat/completions \
    --output_dir ./evaluation_results \
    --dry_run

Evaluation Types

SciCode Tasks

  • scicode: Standard SciCode evaluation without scientist-annotated background
  • scicode_background: SciCode evaluation with scientist-annotated background in the prompts
  • aa_scicode: Artificial Analysis style evaluation with background, dev set inclusion, and 3 samples per problem

Task Descriptions

SciCode

  • SciCode is a challenging benchmark designed to evaluate the capabilities of LLMs in generating code for solving realistic scientific research problems.
  • This variant does not include scientist-annotated background in the prompts.

SciCode-Background

  • SciCode is a challenging benchmark designed to evaluate the capabilities of LLMs in generating code for solving realistic scientific research problems.
  • This variant includes scientist-annotated background in the prompts.

AA-SciCode

  • SciCode is a challenging benchmark designed to evaluate the capabilities of LLMs in generating code for solving realistic scientific research problems.
  • This variant mimics setup used by Artificial Analysis in their Intelligence Benchmark (v2).
  • It includes scientist-annotated background in the prompts and uses all available problems for evaluation (including "dev" set).

Performance Optimizations

SciCode includes several performance optimizations to ensure efficient evaluation:

  • Configurable retries: Adjustable max_retries parameter for API request resilience
  • Parallel evaluation: Configurable eval_threads for concurrent test execution
  • Flexible sampling: Multiple samples per problem for robust evaluation
  • Timeout management: Configurable request_timeout for API calls
  • Background inclusion: Optional scientist-annotated background for enhanced context

Output Format

The evaluation results are saved in the specified output directory with the following structure:

output_dir/
├── results.yml          # Main results file
├── generated_code/      # Generated code files
├── prompt/             # Prompt files
└── logs/               # Evaluation logs

The results.yml file contains the main evaluation metrics including:

  • problems_pass@1: Problem-level pass@1 score
  • steps_pass@1: Step-level pass@1 score
  • n_samples: Number of samples used
  • num_correct: Detailed correctness information per problem and step

This repo contains the evaluation code for the paper "SciCode: A Research Coding Benchmark Curated by Scientists"

🔔News

[2025-02-01]: Results for DeepSeek-R1, DeepSeek-V3, and OpenAI o3-mini are added.

[2025-01-24]: SciCode has been integrated with inspect_ai for easier and faster model evaluations.

[2024-11-04]: Leaderboard is on! Check here. We have also added Claude Sonnet 3.5 (new) results.

[2024-10-01]: We have added OpenAI o1-mini and o1-preview results.

[2024-09-26]: SciCode is accepted at NeurIPS D&B Track 2024.

[2024-08-22]: The SciCode benchmark has been successfully integrated into OpenCompass.

[2024-07-24]: We add the scientist-annotated background and support setup for w/ background evaluation.

Introduction

SciCode is a challenging benchmark designed to evaluate the capabilities of language models (LMs) in generating code for solving realistic scientific research problems. It has a diverse coverage of 16 subdomains from 6 domains: Physics, Math, Material Science, Biology, and Chemistry. Unlike previous benchmarks that consist of exam-like question-answer pairs, SciCode is converted from real research problems. SciCode problems naturally factorize into multiple subproblems, each involving knowledge recall, reasoning, and code synthesis. In total, SciCode contains 338 subproblems decomposed from 80 challenging main problems, and it offers optional descriptions specifying useful scientific background information and scientist-annotated gold-standard solutions and test cases for evaluation. OpenAI o1-preview, the best-performing model among those tested, can solve only 7.7% of the problems in the most realistic setting. Broadly, SciCode demonstrates a realistic and scientists' everyday workflow of identifying critical science concepts and facts and then transforming them into computation and simulation code. We believe SciCode not only helps demonstrate contemporary LLMs' progress towards helpful assistant for scientists but also helps shed light on future building and evaluation of scientific AI.

Dataset Creation

SciCode sources challenging and realistic research-level coding problems across 6 natural science disciplines, covering a total of 16 subfields. Scicode mainly focuses on 1. Numerical methods 2.Simulation of systems 3. Scientific calculation. These are the tasks we believe require intense scientific knowledge and reasoning to optimally test LM’s science capability.

🏆 Leaderboard

Models Main Problem Resolve Rate Subproblem
🥇 OpenAI o3-mini-low
10.8
33.3
🥈 OpenAI o3-mini-high
9.2
34.4
🥉 OpenAI o3-mini-medium
9.2
33.0
OpenAI o1-preview
7.7
28.5
Deepseek-R1
4.6
28.5
Claude3.5-Sonnet
4.6
26.0
Claude3.5-Sonnet (new)
4.6
25.3
Deepseek-v3
3.1
23.7
Deepseek-Coder-v2
3.1
21.2
GPT-4o
1.5
25.0
GPT-4-Turbo
1.5
22.9
OpenAI o1-mini
1.5
22.2
Gemini 1.5 Pro
1.5
21.9
Claude3-Opus
1.5
21.5
Llama-3.1-405B-Chat
1.5
19.8
Claude3-Sonnet
1.5
17.0
Qwen2-72B-Instruct
1.5
17.0
Llama-3.1-70B-Chat
0.0
17.0
Mixtral-8x22B-Instruct
0.0
16.3
Llama-3-70B-Chat
0.0
14.6

Instructions to evaluate a new model

  1. Clone this repository git clone git@github.com:scicode-bench/SciCode.git
  2. Install the scicode package with pip install -e .
  3. Download the numeric test results and save them as ./eval/data/test_data.h5
  4. Run eval/scripts/gencode_json.py to generate new model outputs (see the eval/scripts readme) for more information
  5. Run eval/scripts/test_generated_code.py to evaluate the unittests

Instructions to evaluate a new model using inspect_ai (recommended)

Scicode has been integrated with inspect_ai for easier and faster model evaluation, compared with the methods above. You need to run the first three steps in the above section, and then go to the eval/inspect_ai directory, setup correspoinding API key, and run the following command:

cd eval/inspect_ai
export OPENAI_API_KEY=your-openai-api-key
inspect eval scicode.py --model openai/gpt-4o --temperature 0

For more detailed information of using inspect_ai, see eval/inspect_ai readme

More information and FAQ

More information, including a FAQ section, is provided on our website. If you have trouble reaching the website, please find the markdown source in its github repository.

Attribution

This repository is a fork of the original SciCode project available at https://github.com/scicode-bench/SciCode. See ATTRIBUTION.md for complete attribution information.

Contact

Citation

@misc{tian2024scicode,
    title={SciCode: A Research Coding Benchmark Curated by Scientists},
    author={Minyang Tian and Luyu Gao and Shizhuo Dylan Zhang and Xinan Chen and Cunwei Fan and Xuefei Guo and Roland Haas and Pan Ji and Kittithat Krongchon and Yao Li and Shengyan Liu and Di Luo and Yutao Ma and Hao Tong and Kha Trinh and Chenyu Tian and Zihan Wang and Bohao Wu and Yanyu Xiong and Shengzhu Yin and Minhui Zhu and Kilian Lieret and Yanxin Lu and Genglin Liu and Yufeng Du and Tianhua Tao and Ofir Press and Jamie Callan and Eliu Huerta and Hao Peng},
    year={2024},
    eprint={2407.13168},
    archivePrefix={arXiv},
    primaryClass={cs.AI}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

nvidia_scicode-26.1-py3-none-any.whl (244.4 kB view details)

Uploaded Python 3

File details

Details for the file nvidia_scicode-26.1-py3-none-any.whl.

File metadata

  • Download URL: nvidia_scicode-26.1-py3-none-any.whl
  • Upload date:
  • Size: 244.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for nvidia_scicode-26.1-py3-none-any.whl
Algorithm Hash digest
SHA256 032ded0688d8f2b994ca4a021c2bad0f08c9de70e85939fc9fcdd3200f4816f2
MD5 c458d054693ee094cfcd6aaf96b848fa
BLAKE2b-256 bdcb1b587d34e87c4d13ae08071e28d178ed373442b37d5ae74c1847370845ae

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page