Library for evaluating Large Language Models on CUDA code

Project description

compute-eval

ComputeEval: Evaluating Large Language Models for CUDA Code Generation

ComputeEval is a framework designed to generate and evaluate CUDA code from Large Language Models. It features:

A set of handcrafted CUDA programming challenges ("problem set") designed to evaluate an LLM's capability at writing reliable CUDA code
Utilities for generating multiple solutions to each challenge ("samples")
Utilities for functional correctness of generated CUDA code

ComputeEval is currently in Alpha. We plan to refine the evaluation framework and make frequent updates to the dataset with additional problems spanning all aspects of CUDA development.

Setup

Prerequisites

Python 3.10+ or above
NVIDIA GPU with CUDA Toolkit 12 or greater (for evaluation)

Installation

Install the package:

# pip
pip install .

# Poetry
poetry install

Note: If you use Poetry, version 2.0 or later is recommended.

API Keys

To query an LLM, you must first obtain an API key from the respective service.

NVIDIA NEMO (default)

To use ComputeEval with NVIDIA-hosted models, you need a key from build.nvidia.com.

Go to build.nvidia.com
Sign in with your account
Verify that you have sufficient credits to call hosted models
Navigate to the desired model and click on it
Click on Get API Key
Copy the generated API key
Export it as an environment variable:

export NEMO_API_KEY="<your-nvidia-key>"

OpenAI

Follow the instructions in the OpenAI docs, then:

export OPENAI_API_KEY="<your-openai-key>"

Anthorpic (Claude)

Follow instruction on Anthropic docs, then:

export ANTHROPIC_API_KEY="<your-anthropic-key>"

Usage

Note: This repository executes machine-generated CUDA code. While it's unlikely that the code is malicious, it could still pose potential risks. Therefore, all code execution requires the --allow-execution flag. We strongly recommend using a sandbox environment (e.g., a Docker container or virtual machine) when running generated code to minimize security risks.

ComputeEval can be configured using a YAML file that defines the parameters to the program. For example example_config_gen_samples.yaml:

problem_file: data/cuda_problems_121924.jsonl # Input problems
sample_file: data/samples.jsonl # Generated samples

model: llama-3.1-nemotron-70b-instruct # Model to use
num_samples_per_problem: 3 # Samples to generate per problem

Note: Please set NEMO_API_KEY when using a preset NIM model.

Read the problem_file: data/cuda_problems_121924.jsonl
Generate 3 completions per problem using the llama-3.1-nemotron-70b-instruct model
Write all completions to the output samples file: data/samples.jsonl

To use a custom model:

problem_file: data/problems.jsonl
sample_file: data/samples.jsonl

num_samples_per_problem: 3

custom_model:
  api_endpoint: https://integrate.api.nvidia.com/v1
  model_id: nvidia/llama-3.1-nemotron-70b-instruct

Note: Please set OPENAI_API_KEY when using a custom model.

Models Available

The models available for completions are listed below:

"mixtral-8x22b" => mistralai/mixtral-8x22b-instruct-v0.1
"gemma-2b" => google/gemma-2b
"llama3.1-8b" => meta/llama-3.1-8b-instruct
"llama3.1-70b" => meta/llama-3.1-70b-instruct
"llama3.1-405b" => meta/llama-3.1-405b-instruct
"llama3.2-1b" => meta/llama-3.2-1b-instruct
"llama3.2-3b" => meta/llama-3.2-3b-instruct
"llama3.2-90b" => meta/llama-3.2-90b-vision-instruct
"llama3.1-nemotron-70b" => nvidia/llama-3.1-nemotron-70b-instruct
"nemotron-mini-4b" => nvidia/nemotron-mini-4b-instruct
"starcoder2-7b" => bigcode/starcoder2-7b
"mistral-nemo-12b" => nv-mistralai/mistral-nemo-12b-instruct
"openai-" => nv-mistralai/mistral-nemo-12b-instruct

By default, NVIDIA hosted llama-3.1-70b-instruct is used.

Generate samples and evaluate

Generate samples based on the config file:

compute_eval generate_samples -config_file=example_config_gen_samples.yaml

Now you have a data/samples.jsonl.

To launch an evaluation on the generated samples create a config file where the content of example_config_evalcorrectness.yaml:

sample_file: data/samples.jsonl
problem_file: data/cuda_problems_121924.jsonl

k: [1, 3]

compute_eval evaluate_functional_correctness -config_file=example_config_evalcorrectness.yaml

Note: the program will ask you to allow code execution by adding the --allow-execution flag.

This will read the problems and the sample file
It will run each of the samples through a functional correctness testing suite
It will output a pass@k dictionary with 2 pass@k values for k = 1 nand k = 3

Caveats:

The k argument for evaluate_functional_correctness should be a comma-separated e.g., [1,10].
Note that if you have a list of k that you want used in evaluation, then max(k) <= num_samples_per_problem else that k value will not show up in the pass@k dict generated.

Command docs

`generate_samples`

This command generates samples for given problems using a specified model and writes them to the specified sample_file.

Arguments

problem_file (str): The path to the file containing the problems to generate samples for.
sample_file (str, optional): The path to the file where the generated samples will be written. (default: generated_samples.jsonl).
num_samples_per_problem (int, optional): The number of samples to generate per problem (default: 100).
n_workers (int, optional): The number of worker threads to use (default: 20).
system_prompt (str, optional): The system prompt to use (default: a predefined CUDA programming prompt).
max_tokens (int, optional): The maximum number of tokens for the model to generate (default: 1024).
print_completions (bool, optional): Flag to specify if you want the completions printed to stdout. (default: False)
model (str, optional): The model to use for generating samples (default: "llama3.1-70b").
model_type (str, optional): The type of model (default: "instruct").
custom_model(dict, optional): api_endpoint (base url) and model_id (model name) for any model that uses the OpenAI API. Please use the OPENAI_API_KEY to set your credentials when using a custom model.
params (dict, optional): parameters for the chat completions request - temperature, top_p, max_tokens.

`evaluate_functional_correctness`

This command evaluates the functional correctness of generated samples and outputs a pass@k dictionary

Arguments

sample_file (str): The path to the file containing the samples to be evaluated.
problem_file (str): The path to the file containing the problems to evaluate against.
k (str, optional): The list of values for k, as a comma-separated string (default: "1,10,100").
n_workers (int, optional): The number of worker threads to use (default: 4).
timeout (float, optional): The timeout for each evaluation in seconds (default: 3.0).
save_completions_dir (str, optional): Directory path where the samples will be stored as .cu files (default: "" i.e not saved)

Dataset

For more information about the dataset see DATASET_CARD.md.

Contributing

See contributing.md for development instructions.

Project details

Release history Release notifications | RSS feed

This version

26.3

Mar 16, 2026

26.1

Feb 27, 2026

25.11

Dec 4, 2025

25.10

Oct 31, 2025

25.9.1

Oct 23, 2025

25.9

Oct 17, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

nvidia_compute_eval-26.3-py3-none-any.whl (283.6 kB view details)

Uploaded Mar 16, 2026 Python 3

File details

Details for the file nvidia_compute_eval-26.3-py3-none-any.whl.

File metadata

Download URL: nvidia_compute_eval-26.3-py3-none-any.whl
Upload date: Mar 16, 2026
Size: 283.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for nvidia_compute_eval-26.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`731be070da2ac2120a164fff7345ce22bd0d74e6c464b63e595c09847e96f1c1`
MD5	`df1ad28781503a324088baa5366c1e51`
BLAKE2b-256	`ae834a8fefe6236a5950452aff4340bd9fc92b02d67e49f511d20407b69fcf23`

See more details on using hashes here.

nvidia-compute-eval 26.3

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

compute-eval

Setup

Prerequisites

Installation

API Keys

NVIDIA NEMO (default)

OpenAI

Anthorpic (Claude)

Usage

Models Available

Generate samples and evaluate

Command docs

`generate_samples`

Arguments

`evaluate_functional_correctness`

Arguments

Dataset

Contributing

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

File details

File metadata

File hashes