SciCode - packaged by NVIDIA
Project description
SciCode
NVIDIA Eval Factory
SciCode provides evaluation clients specifically built to evaluate model endpoints using our Standard API for scientific code generation tasks.
Launching an evaluation for an LLM
Install the package
pip install nvidia-scicode
(Optional) Set a token to your API endpoint if it's protected
export MY_API_KEY="your_api_key_here"
export HF_TOKEN="your_huggingface_token_here"
List the available evaluations
eval-factory ls
Available tasks:
- scicode
- scicode_background
- aa_scicode
Run the evaluation of your choice
eval-factory run_eval \
--eval_type aa_scicode \
--model_id meta/llama-3.1-8b-instruct \
--model_url https://integrate.api.nvidia.com/v1/chat/completions \
--model_type chat \
--api_key_name MY_API_KEY \
--output_dir /workspace/results
Gather the results
cat /workspace/results/results.yml
Command-Line Tool
Each package comes pre-installed with a set of command-line tools, designed to simplify the execution of evaluation tasks. Below are the available commands and their usage for the SciCode evaluations:
Commands
1. List Evaluation Types
eval-factory ls
Displays the evaluation types available within the harness.
2. Run an evaluation
The eval-factory run_eval command executes the evaluation process. Below are the flags and their descriptions:
Required flags:
--eval_type <string>: The type of evaluation to perform (e.g., scicode, scicode_background, aa_scicode)--model_id <string>: The name or identifier of the model to evaluate.--model_url <url>: The API endpoint where the model is accessible.--model_type <string>: The type of the model to evaluate, currently either "chat", "completions", or "vlm".--output_dir <directory>: The directory to use as the working directory for the evaluation. The results, including the results.yml output file, will be saved here.
Optional flags:
--api_key_name <string>: The name of the environment variable that stores the Bearer token for the API, if authentication is required.--run_config <path>: Specifies the path to a YAML file containing the evaluation definition.--overrides <string>: Override configuration parameters (e.g., 'config.params.limit_samples=10').
Examples
Basic SciCode Evaluation
eval-factory run_eval \
--eval_type scicode \
--model_id meta/llama-3.1-8b-instruct \
--model_type chat \
--model_url https://integrate.api.nvidia.com/v1/chat/completions \
--output_dir ./evaluation_results
SciCode with Background Information
eval-factory run_eval \
--eval_type scicode_background \
--model_id meta/llama-3.1-8b-instruct \
--model_type chat \
--model_url https://integrate.api.nvidia.com/v1/chat/completions \
--output_dir ./evaluation_results
AA-SciCode Evaluation (Artificial Analysis Style)
eval-factory run_eval \
--eval_type aa_scicode \
--model_id meta/llama-3.1-8b-instruct \
--model_type chat \
--model_url https://integrate.api.nvidia.com/v1/chat/completions \
--output_dir ./evaluation_results
SciCode with Authentication
If the model API requires authentication, set the API key in an environment variable and reference it using the --api_key_name flag:
export MY_API_KEY="your_api_key_here"
eval-factory run_eval \
--eval_type scicode \
--model_id meta/llama-3.1-8b-instruct \
--model_type chat \
--model_url https://integrate.api.nvidia.com/v1/chat/completions \
--api_key_name MY_API_KEY \
--output_dir ./evaluation_results
Limited Sample Evaluation
eval-factory run_eval \
--eval_type scicode \
--model_id meta/llama-3.1-8b-instruct \
--model_type chat \
--model_url https://integrate.api.nvidia.com/v1/chat/completions \
--output_dir ./evaluation_results \
--overrides 'config.params.limit_samples=10'
Configuring evaluations via YAML
Evaluations in SciCode are configured using YAML files that define the parameters and settings required for the evaluation process. These configuration files follow a standard API which ensures consistency across evaluations.
Example of a YAML config:
config:
type: scicode
params:
parallelism: 10
limit_samples: 20
max_new_tokens: 2048
temperature: 0.0
top_p: 0.00001
request_timeout: 60
max_retries: 2
extra:
n_samples: 1
with_background: false
include_dev: false
eval_threads: null
target:
api_endpoint:
model_id: meta/llama-3.1-8b-instruct
type: chat
url: https://integrate.api.nvidia.com/v1/chat/completions
api_key: MY_API_KEY
The priority of overrides is as follows:
- command line arguments
- user config (as seen above)
- task defaults (defined per task type)
- framework defaults
The --dry_run option allows you to print the final run configuration and command without executing the evaluation.
Example:
eval-factory run_eval \
--eval_type scicode \
--model_id meta/llama-3.1-8b-instruct \
--model_type chat \
--model_url https://integrate.api.nvidia.com/v1/chat/completions \
--output_dir ./evaluation_results \
--dry_run
Evaluation Types
SciCode Tasks
- scicode: Standard SciCode evaluation without scientist-annotated background
- scicode_background: SciCode evaluation with scientist-annotated background in the prompts
- aa_scicode: Artificial Analysis style evaluation with background, dev set inclusion, and 3 samples per problem
Task Descriptions
SciCode
- SciCode is a challenging benchmark designed to evaluate the capabilities of LLMs in generating code for solving realistic scientific research problems.
- This variant does not include scientist-annotated background in the prompts.
SciCode-Background
- SciCode is a challenging benchmark designed to evaluate the capabilities of LLMs in generating code for solving realistic scientific research problems.
- This variant includes scientist-annotated background in the prompts.
AA-SciCode
- SciCode is a challenging benchmark designed to evaluate the capabilities of LLMs in generating code for solving realistic scientific research problems.
- This variant mimics setup used by Artificial Analysis in their Intelligence Benchmark (v2).
- It includes scientist-annotated background in the prompts and uses all available problems for evaluation (including "dev" set).
Performance Optimizations
SciCode includes several performance optimizations to ensure efficient evaluation:
- Configurable retries: Adjustable max_retries parameter for API request resilience
- Parallel evaluation: Configurable eval_threads for concurrent test execution
- Flexible sampling: Multiple samples per problem for robust evaluation
- Timeout management: Configurable request_timeout for API calls
- Background inclusion: Optional scientist-annotated background for enhanced context
Output Format
The evaluation results are saved in the specified output directory with the following structure:
output_dir/
├── results.yml # Main results file
├── generated_code/ # Generated code files
├── prompt/ # Prompt files
└── logs/ # Evaluation logs
The results.yml file contains the main evaluation metrics including:
problems_pass@1: Problem-level pass@1 scoresteps_pass@1: Step-level pass@1 scoren_samples: Number of samples usednum_correct: Detailed correctness information per problem and step
This repo contains the evaluation code for the paper "SciCode: A Research Coding Benchmark Curated by Scientists"
🔔News
[2025-02-01]: Results for DeepSeek-R1, DeepSeek-V3, and OpenAI o3-mini are added.
[2025-01-24]: SciCode has been integrated with inspect_ai for easier and faster model evaluations.
[2024-11-04]: Leaderboard is on! Check here. We have also added Claude Sonnet 3.5 (new) results.
[2024-10-01]: We have added OpenAI o1-mini and o1-preview results.
[2024-09-26]: SciCode is accepted at NeurIPS D&B Track 2024.
[2024-08-22]: The SciCode benchmark has been successfully integrated into OpenCompass.
[2024-07-24]: We add the scientist-annotated background and support setup for w/ background evaluation.
Introduction
SciCode is a challenging benchmark designed to evaluate the capabilities of language models (LMs) in generating code for solving realistic scientific research problems. It has a diverse coverage of 16 subdomains from 6 domains: Physics, Math, Material Science, Biology, and Chemistry. Unlike previous benchmarks that consist of exam-like question-answer pairs, SciCode is converted from real research problems. SciCode problems naturally factorize into multiple subproblems, each involving knowledge recall, reasoning, and code synthesis. In total, SciCode contains 338 subproblems decomposed from 80 challenging main problems, and it offers optional descriptions specifying useful scientific background information and scientist-annotated gold-standard solutions and test cases for evaluation. OpenAI o1-preview, the best-performing model among those tested, can solve only 7.7% of the problems in the most realistic setting. Broadly, SciCode demonstrates a realistic and scientists' everyday workflow of identifying critical science concepts and facts and then transforming them into computation and simulation code. We believe SciCode not only helps demonstrate contemporary LLMs' progress towards helpful assistant for scientists but also helps shed light on future building and evaluation of scientific AI.
Dataset Creation
SciCode sources challenging and realistic research-level coding problems across 6 natural science disciplines, covering a total of 16 subfields. Scicode mainly focuses on 1. Numerical methods 2.Simulation of systems 3. Scientific calculation. These are the tasks we believe require intense scientific knowledge and reasoning to optimally test LM’s science capability.
🏆 Leaderboard
| Models | Main Problem Resolve Rate | Subproblem |
|---|---|---|
| 🥇 OpenAI o3-mini-low | 10.8 |
33.3 |
| 🥈 OpenAI o3-mini-high | 9.2 |
34.4 |
| 🥉 OpenAI o3-mini-medium | 9.2 |
33.0 |
| OpenAI o1-preview | 7.7 |
28.5 |
| Deepseek-R1 | 4.6 |
28.5 |
| Claude3.5-Sonnet | 4.6 |
26.0 |
| Claude3.5-Sonnet (new) | 4.6 |
25.3 |
| Deepseek-v3 | 3.1 |
23.7 |
| Deepseek-Coder-v2 | 3.1 |
21.2 |
| GPT-4o | 1.5 |
25.0 |
| GPT-4-Turbo | 1.5 |
22.9 |
| OpenAI o1-mini | 1.5 |
22.2 |
| Gemini 1.5 Pro | 1.5 |
21.9 |
| Claude3-Opus | 1.5 |
21.5 |
| Llama-3.1-405B-Chat | 1.5 |
19.8 |
| Claude3-Sonnet | 1.5 |
17.0 |
| Qwen2-72B-Instruct | 1.5 |
17.0 |
| Llama-3.1-70B-Chat | 0.0 |
17.0 |
| Mixtral-8x22B-Instruct | 0.0 |
16.3 |
| Llama-3-70B-Chat | 0.0 |
14.6 |
Instructions to evaluate a new model
- Clone this repository
git clone git@github.com:scicode-bench/SciCode.git - Install the
scicodepackage withpip install -e . - Download the numeric test results and save them as
./eval/data/test_data.h5 - Run
eval/scripts/gencode_json.pyto generate new model outputs (see theeval/scriptsreadme) for more information - Run
eval/scripts/test_generated_code.pyto evaluate the unittests
Instructions to evaluate a new model using inspect_ai (recommended)
Scicode has been integrated with inspect_ai for easier and faster model evaluation, compared with the methods above. You need to run the first three steps in the above section, and then go to the eval/inspect_ai directory, setup correspoinding API key, and run the following command:
cd eval/inspect_ai
export OPENAI_API_KEY=your-openai-api-key
inspect eval scicode.py --model openai/gpt-4o --temperature 0
For more detailed information of using inspect_ai, see eval/inspect_ai readme
More information and FAQ
More information, including a FAQ section, is provided on our website. If you have trouble reaching the website, please find the markdown source in its github repository.
Attribution
This repository is a fork of the original SciCode project available at https://github.com/scicode-bench/SciCode. See ATTRIBUTION.md for complete attribution information.
Contact
- Minyang Tian: mtian8@illinois.edu
- Eliu Huerta: elihu@anl.gov
- Hao Peng: haopeng@illinois.edu
Citation
@misc{tian2024scicode,
title={SciCode: A Research Coding Benchmark Curated by Scientists},
author={Minyang Tian and Luyu Gao and Shizhuo Dylan Zhang and Xinan Chen and Cunwei Fan and Xuefei Guo and Roland Haas and Pan Ji and Kittithat Krongchon and Yao Li and Shengyan Liu and Di Luo and Yutao Ma and Hao Tong and Kha Trinh and Chenyu Tian and Zihan Wang and Bohao Wu and Yanyu Xiong and Shengzhu Yin and Minhui Zhu and Kilian Lieret and Yanxin Lu and Genglin Liu and Yufeng Du and Tianhua Tao and Ofir Press and Jamie Callan and Eliu Huerta and Hao Peng},
year={2024},
eprint={2407.13168},
archivePrefix={arXiv},
primaryClass={cs.AI}
}
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file nvidia_scicode-25.9-py3-none-any.whl.
File metadata
- Download URL: nvidia_scicode-25.9-py3-none-any.whl
- Upload date:
- Size: 243.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.10.18
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ddad29083ba7cffb8b5eddcc1fa413ee76ce8f1ea68a194503aeb834fe733fb3
|
|
| MD5 |
c53f9876a0445333be63841df9af6457
|
|
| BLAKE2b-256 |
91c3054f824774ed9a50327d390b1a8fdb5eddf0009c810889054b69ef27deaa
|