SciCode - packaged by NVIDIA

These details have not been verified by PyPI

Project links

Project description

SciCode

NVIDIA NeMo Evaluator

SciCode provides evaluation clients specifically built to evaluate model endpoints using our Standard API for scientific code generation tasks.

Launching an evaluation for an LLM

Install the package

pip install nvidia-scicode

(Optional) Set a token to your API endpoint if it's protected

export MY_API_KEY="your_api_key_here"
export HF_TOKEN="your_huggingface_token_here"

List the available evaluations

nemo-evaluator ls

Available tasks:

scicode
scicode_background
aa_scicode

Run the evaluation of your choice

nemo-evaluator run_eval \
    --eval_type aa_scicode \
    --model_id meta/llama-3.1-8b-instruct \
    --model_url https://integrate.api.nvidia.com/v1/chat/completions \
    --model_type chat \
    --api_key_name MY_API_KEY \
    --output_dir /workspace/results

Gather the results

cat /workspace/results/results.yml

Command-Line Tool

Each package comes pre-installed with a set of command-line tools, designed to simplify the execution of evaluation tasks. Below are the available commands and their usage for the SciCode evaluations:

Commands

1. List Evaluation Types

nemo-evaluator ls

Displays the evaluation types available within the harness.

2. Run an evaluation

The nemo-evaluator run_eval command executes the evaluation process. Below are the flags and their descriptions:

Required flags:

--eval_type <string>: The type of evaluation to perform (e.g., scicode, scicode_background, aa_scicode)
--model_id <string>: The name or identifier of the model to evaluate.
--model_url <url>: The API endpoint where the model is accessible.
--model_type <string>: The type of the model to evaluate, currently either "chat", "completions", or "vlm".
--output_dir <directory>: The directory to use as the working directory for the evaluation. The results, including the results.yml output file, will be saved here.

Optional flags:

--api_key_name <string>: The name of the environment variable that stores the Bearer token for the API, if authentication is required.
--run_config <path>: Specifies the path to a YAML file containing the evaluation definition.
--overrides <string>: Override configuration parameters (e.g., 'config.params.limit_samples=10').

Examples

Basic SciCode Evaluation

nemo-evaluator run_eval \
    --eval_type scicode \
    --model_id meta/llama-3.1-8b-instruct \
    --model_type chat \
    --model_url https://integrate.api.nvidia.com/v1/chat/completions \
    --output_dir ./evaluation_results

SciCode with Background Information

nemo-evaluator run_eval \
    --eval_type scicode_background \
    --model_id meta/llama-3.1-8b-instruct \
    --model_type chat \
    --model_url https://integrate.api.nvidia.com/v1/chat/completions \
    --output_dir ./evaluation_results

AA-SciCode Evaluation (Artificial Analysis Style)

nemo-evaluator run_eval \
    --eval_type aa_scicode \
    --model_id meta/llama-3.1-8b-instruct \
    --model_type chat \
    --model_url https://integrate.api.nvidia.com/v1/chat/completions \
    --output_dir ./evaluation_results

SciCode with Authentication

If the model API requires authentication, set the API key in an environment variable and reference it using the --api_key_name flag:

export MY_API_KEY="your_api_key_here"

nemo-evaluator run_eval \
    --eval_type scicode \
    --model_id meta/llama-3.1-8b-instruct \
    --model_type chat \
    --model_url https://integrate.api.nvidia.com/v1/chat/completions \
    --api_key_name MY_API_KEY \
    --output_dir ./evaluation_results

Limited Sample Evaluation

nemo-evaluator run_eval \
    --eval_type scicode \
    --model_id meta/llama-3.1-8b-instruct \
    --model_type chat \
    --model_url https://integrate.api.nvidia.com/v1/chat/completions \
    --output_dir ./evaluation_results \
    --overrides 'config.params.limit_samples=10'

Configuring evaluations via YAML

Evaluations in SciCode are configured using YAML files that define the parameters and settings required for the evaluation process. These configuration files follow a standard API which ensures consistency across evaluations.

Example of a YAML config:

config:
  type: scicode
  params:
    parallelism: 10
    limit_samples: 20
    max_new_tokens: 2048
    temperature: 0.0
    top_p: 0.00001
    request_timeout: 60
    max_retries: 2
    extra:
      n_samples: 1
      with_background: false
      include_dev: false
      eval_threads: null
target:
  api_endpoint:
    model_id: meta/llama-3.1-8b-instruct
    type: chat
    url: https://integrate.api.nvidia.com/v1/chat/completions
    api_key_name: MY_API_KEY

The priority of overrides is as follows:

command line arguments
user config (as seen above)
task defaults (defined per task type)
framework defaults

The --dry_run option allows you to print the final run configuration and command without executing the evaluation.

Example:

nemo-evaluator run_eval \
    --eval_type scicode \
    --model_id meta/llama-3.1-8b-instruct \
    --model_type chat \
    --model_url https://integrate.api.nvidia.com/v1/chat/completions \
    --output_dir ./evaluation_results \
    --dry_run

Evaluation Types

SciCode Tasks

scicode: Standard SciCode evaluation without scientist-annotated background
scicode_background: SciCode evaluation with scientist-annotated background in the prompts
aa_scicode: Artificial Analysis style evaluation with background, dev set inclusion, and 3 samples per problem

Task Descriptions

SciCode

SciCode is a challenging benchmark designed to evaluate the capabilities of LLMs in generating code for solving realistic scientific research problems.
This variant does not include scientist-annotated background in the prompts.

SciCode-Background

SciCode is a challenging benchmark designed to evaluate the capabilities of LLMs in generating code for solving realistic scientific research problems.
This variant includes scientist-annotated background in the prompts.

AA-SciCode

SciCode is a challenging benchmark designed to evaluate the capabilities of LLMs in generating code for solving realistic scientific research problems.
This variant mimics setup used by Artificial Analysis in their Intelligence Benchmark (v2).
It includes scientist-annotated background in the prompts and uses all available problems for evaluation (including "dev" set).

Performance Optimizations

SciCode includes several performance optimizations to ensure efficient evaluation:

Configurable retries: Adjustable max_retries parameter for API request resilience
Parallel evaluation: Configurable eval_threads for concurrent test execution
Flexible sampling: Multiple samples per problem for robust evaluation
Timeout management: Configurable request_timeout for API calls
Background inclusion: Optional scientist-annotated background for enhanced context

Output Format

The evaluation results are saved in the specified output directory with the following structure:

output_dir/
├── results.yml          # Main results file
├── generated_code/      # Generated code files
├── prompt/             # Prompt files
└── logs/               # Evaluation logs

The results.yml file contains the main evaluation metrics including:

problems_pass@1: Problem-level pass@1 score
steps_pass@1: Step-level pass@1 score
n_samples: Number of samples used
num_correct: Detailed correctness information per problem and step

This repo contains the evaluation code for the paper "SciCode: A Research Coding Benchmark Curated by Scientists"

🔔News

[2025-02-01]: Results for DeepSeek-R1, DeepSeek-V3, and OpenAI o3-mini are added.

[2025-01-24]: SciCode has been integrated with inspect_ai for easier and faster model evaluations.

[2024-11-04]: Leaderboard is on! Check here. We have also added Claude Sonnet 3.5 (new) results.

[2024-10-01]: We have added OpenAI o1-mini and o1-preview results.

[2024-09-26]: SciCode is accepted at NeurIPS D&B Track 2024.

[2024-08-22]: The SciCode benchmark has been successfully integrated into OpenCompass.

[2024-07-24]: We add the scientist-annotated background and support setup for w/ background evaluation.

Introduction

SciCode is a challenging benchmark designed to evaluate the capabilities of language models (LMs) in generating code for solving realistic scientific research problems. It has a diverse coverage of 16 subdomains from 6 domains: Physics, Math, Material Science, Biology, and Chemistry. Unlike previous benchmarks that consist of exam-like question-answer pairs, SciCode is converted from real research problems. SciCode problems naturally factorize into multiple subproblems, each involving knowledge recall, reasoning, and code synthesis. In total, SciCode contains 338 subproblems decomposed from 80 challenging main problems, and it offers optional descriptions specifying useful scientific background information and scientist-annotated gold-standard solutions and test cases for evaluation. OpenAI o1-preview, the best-performing model among those tested, can solve only 7.7% of the problems in the most realistic setting. Broadly, SciCode demonstrates a realistic and scientists' everyday workflow of identifying critical science concepts and facts and then transforming them into computation and simulation code. We believe SciCode not only helps demonstrate contemporary LLMs' progress towards helpful assistant for scientists but also helps shed light on future building and evaluation of scientific AI.

Dataset Creation

SciCode sources challenging and realistic research-level coding problems across 6 natural science disciplines, covering a total of 16 subfields. Scicode mainly focuses on 1. Numerical methods 2.Simulation of systems 3. Scientific calculation. These are the tasks we believe require intense scientific knowledge and reasoning to optimally test LM’s science capability.

🏆 Leaderboard

Models	Main Problem Resolve Rate	Subproblem
🥇 OpenAI o3-mini-low	10.8	33.3
🥈 OpenAI o3-mini-high	9.2	34.4
🥉 OpenAI o3-mini-medium	9.2	33.0
OpenAI o1-preview	7.7	28.5
Deepseek-R1	4.6	28.5
Claude3.5-Sonnet	4.6	26.0
Claude3.5-Sonnet (new)	4.6	25.3
Deepseek-v3	3.1	23.7
Deepseek-Coder-v2	3.1	21.2
GPT-4o	1.5	25.0
GPT-4-Turbo	1.5	22.9
OpenAI o1-mini	1.5	22.2
Gemini 1.5 Pro	1.5	21.9
Claude3-Opus	1.5	21.5
Llama-3.1-405B-Chat	1.5	19.8
Claude3-Sonnet	1.5	17.0
Qwen2-72B-Instruct	1.5	17.0
Llama-3.1-70B-Chat	0.0	17.0
Mixtral-8x22B-Instruct	0.0	16.3
Llama-3-70B-Chat	0.0	14.6

Instructions to evaluate a new model

Clone this repository git clone git@github.com:scicode-bench/SciCode.git
Install the scicode package with pip install -e .
Download the numeric test results and save them as ./eval/data/test_data.h5
Run eval/scripts/gencode_json.py to generate new model outputs (see the eval/scripts readme) for more information
Run eval/scripts/test_generated_code.py to evaluate the unittests

Instructions to evaluate a new model using `inspect_ai` (recommended)

Scicode has been integrated with inspect_ai for easier and faster model evaluation, compared with the methods above. You need to run the first three steps in the above section, and then go to the eval/inspect_ai directory, setup correspoinding API key, and run the following command:

cd eval/inspect_ai
export OPENAI_API_KEY=your-openai-api-key
inspect eval scicode.py --model openai/gpt-4o --temperature 0

For more detailed information of using inspect_ai, see eval/inspect_ai readme

More information and FAQ

More information, including a FAQ section, is provided on our website. If you have trouble reaching the website, please find the markdown source in its github repository.

Attribution

This repository is a fork of the original SciCode project available at https://github.com/scicode-bench/SciCode. See ATTRIBUTION.md for complete attribution information.

Contact

Minyang Tian: mtian8@illinois.edu
Eliu Huerta: elihu@anl.gov
Hao Peng: haopeng@illinois.edu

Citation

@misc{tian2024scicode,
    title={SciCode: A Research Coding Benchmark Curated by Scientists},
    author={Minyang Tian and Luyu Gao and Shizhuo Dylan Zhang and Xinan Chen and Cunwei Fan and Xuefei Guo and Roland Haas and Pan Ji and Kittithat Krongchon and Yao Li and Shengyan Liu and Di Luo and Yutao Ma and Hao Tong and Kha Trinh and Chenyu Tian and Zihan Wang and Bohao Wu and Yanyu Xiong and Shengzhu Yin and Minhui Zhu and Kilian Lieret and Yanxin Lu and Genglin Liu and Yufeng Du and Tianhua Tao and Ofir Press and Jamie Callan and Eliu Huerta and Hao Peng},
    year={2024},
    eprint={2407.13168},
    archivePrefix={arXiv},
    primaryClass={cs.AI}
}

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

26.3

Mar 16, 2026

This version

26.1

Feb 27, 2026

25.11

Dec 4, 2025

25.10

Oct 31, 2025

25.9.1

Oct 23, 2025

25.9

Oct 3, 2025

25.8.1

Sep 16, 2025

25.8

Sep 4, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

nvidia_scicode-26.1-py3-none-any.whl (244.4 kB view details)

Uploaded Feb 27, 2026 Python 3

File details

Details for the file nvidia_scicode-26.1-py3-none-any.whl.

File metadata

Download URL: nvidia_scicode-26.1-py3-none-any.whl
Upload date: Feb 27, 2026
Size: 244.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for nvidia_scicode-26.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`032ded0688d8f2b994ca4a021c2bad0f08c9de70e85939fc9fcdd3200f4816f2`
MD5	`c458d054693ee094cfcd6aaf96b848fa`
BLAKE2b-256	`bdcb1b587d34e87c4d13ae08071e28d178ed373442b37d5ae74c1847370845ae`

See more details on using hashes here.

nvidia-scicode 26.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

SciCode

NVIDIA NeMo Evaluator

Launching an evaluation for an LLM

Install the package

(Optional) Set a token to your API endpoint if it's protected

List the available evaluations

Run the evaluation of your choice

Gather the results

Command-Line Tool

Commands

1. List Evaluation Types

2. Run an evaluation

Examples

Basic SciCode Evaluation

SciCode with Background Information

AA-SciCode Evaluation (Artificial Analysis Style)

SciCode with Authentication

Limited Sample Evaluation

Configuring evaluations via YAML

Example of a YAML config:

Example:

Evaluation Types

SciCode Tasks

Task Descriptions

SciCode

SciCode-Background

AA-SciCode

Performance Optimizations

Output Format

🔔News

Introduction

Dataset Creation

🏆 Leaderboard

Instructions to evaluate a new model

Instructions to evaluate a new model using inspect_ai (recommended)

More information and FAQ

Attribution

Contact

Citation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

File details

File metadata

File hashes

Instructions to evaluate a new model using `inspect_ai` (recommended)