Library for evaluating Large Language Models on CUDA code
Project description
compute-eval
ComputeEval: Evaluating Large Language Models for CUDA Code Generation
ComputeEval is a framework designed to generate and evaluate CUDA code from Large Language Models. It features:
- A set of handcrafted CUDA programming challenges ("problem set") designed to evaluate an LLM's capability at writing reliable CUDA code
- Utilities for generating multiple solutions to each challenge ("samples")
- Utilities for functional correctness of generated CUDA code
ComputeEval is currently in Alpha. We plan to refine the evaluation framework and make frequent updates to the dataset with additional problems spanning all aspects of CUDA development.
Setup
Prerequisites
- Python 3.10+ or above
- NVIDIA GPU with CUDA Toolkit 12 or greater (for evaluation)
Installation
Install the package:
# pip
pip install .
# Poetry
poetry install
Note: If you use Poetry, version 2.0 or later is recommended.
API Keys
To query an LLM, you must first obtain an API key from the respective service.
NVIDIA NEMO (default)
To use ComputeEval with NVIDIA-hosted models, you need a key from build.nvidia.com.
- Go to build.nvidia.com
- Sign in with your account
- Verify that you have sufficient credits to call hosted models
- Navigate to the desired model and click on it
- Click on
Get API Key - Copy the generated API key
- Export it as an environment variable:
export NEMO_API_KEY="<your-nvidia-key>"
OpenAI
Follow the instructions in the OpenAI docs, then:
export OPENAI_API_KEY="<your-openai-key>"
Anthorpic (Claude)
Follow instruction on Anthropic docs, then:
export ANTHROPIC_API_KEY="<your-anthropic-key>"
Usage
Note: This repository executes machine-generated CUDA code.
While it's unlikely that the code is malicious, it could still pose potential risks.
Therefore, all code execution requires the --allow-execution flag.
We strongly recommend using a sandbox environment (e.g., a Docker container or virtual machine) when running generated code to minimize security risks.
ComputeEval can be configured using a YAML file that defines the parameters to the program.
For example example_config_gen_samples.yaml:
problem_file: data/cuda_problems_121924.jsonl # Input problems
sample_file: data/samples.jsonl # Generated samples
model: llama-3.1-nemotron-70b-instruct # Model to use
num_samples_per_problem: 3 # Samples to generate per problem
Note: Please set NEMO_API_KEY when using a preset NIM model.
- Read the problem_file:
data/cuda_problems_121924.jsonl - Generate 3 completions per problem using the
llama-3.1-nemotron-70b-instructmodel - Write all completions to the output samples file:
data/samples.jsonl
To use a custom model:
problem_file: data/problems.jsonl
sample_file: data/samples.jsonl
num_samples_per_problem: 3
custom_model:
api_endpoint: https://integrate.api.nvidia.com/v1
model_id: nvidia/llama-3.1-nemotron-70b-instruct
Note: Please set OPENAI_API_KEY when using a custom model.
Models Available
The models available for completions are listed below:
- "mixtral-8x22b" => mistralai/mixtral-8x22b-instruct-v0.1
- "gemma-2b" => google/gemma-2b
- "llama3.1-8b" => meta/llama-3.1-8b-instruct
- "llama3.1-70b" => meta/llama-3.1-70b-instruct
- "llama3.1-405b" => meta/llama-3.1-405b-instruct
- "llama3.2-1b" => meta/llama-3.2-1b-instruct
- "llama3.2-3b" => meta/llama-3.2-3b-instruct
- "llama3.2-90b" => meta/llama-3.2-90b-vision-instruct
- "llama3.1-nemotron-70b" => nvidia/llama-3.1-nemotron-70b-instruct
- "nemotron-mini-4b" => nvidia/nemotron-mini-4b-instruct
- "starcoder2-7b" => bigcode/starcoder2-7b
- "mistral-nemo-12b" => nv-mistralai/mistral-nemo-12b-instruct
- "openai-" => nv-mistralai/mistral-nemo-12b-instruct
By default, NVIDIA hosted llama-3.1-70b-instruct is used.
Generate samples and evaluate
Generate samples based on the config file:
compute_eval generate_samples -config_file=example_config_gen_samples.yaml
Now you have a data/samples.jsonl.
To launch an evaluation on the generated samples create a config file
where the content of example_config_evalcorrectness.yaml:
sample_file: data/samples.jsonl
problem_file: data/cuda_problems_121924.jsonl
k: [1, 3]
compute_eval evaluate_functional_correctness -config_file=example_config_evalcorrectness.yaml
Note: the program will ask you to allow code execution by adding the --allow-execution flag.
- This will read the problems and the sample file
- It will run each of the samples through a functional correctness testing suite
- It will output a
pass@kdictionary with 2pass@kvalues for k = 1 nand k = 3
Caveats:
- The
kargument forevaluate_functional_correctnessshould be a comma-separated e.g.,[1,10]. - Note that if you have a list of
kthat you want used in evaluation, thenmax(k) <= num_samples_per_problemelse thatkvalue will not show up in the pass@k dict generated.
Command docs
generate_samples
This command generates samples for given problems using a specified model and writes them to the specified sample_file.
Arguments
problem_file(str): The path to the file containing the problems to generate samples for.sample_file(str, optional): The path to the file where the generated samples will be written. (default:generated_samples.jsonl).num_samples_per_problem(int, optional): The number of samples to generate per problem (default: 100).n_workers(int, optional): The number of worker threads to use (default: 20).system_prompt(str, optional): The system prompt to use (default: a predefined CUDA programming prompt).max_tokens(int, optional): The maximum number of tokens for the model to generate (default: 1024).print_completions(bool, optional): Flag to specify if you want the completions printed to stdout. (default: False)model(str, optional): The model to use for generating samples (default: "llama3.1-70b").model_type(str, optional): The type of model (default: "instruct").custom_model(dict, optional): api_endpoint (base url) and model_id (model name) for any model that uses the OpenAI API. Please use the OPENAI_API_KEY to set your credentials when using a custom model.params(dict, optional): parameters for the chat completions request - temperature, top_p, max_tokens.
evaluate_functional_correctness
This command evaluates the functional correctness of generated samples and outputs a pass@k dictionary
Arguments
sample_file(str): The path to the file containing the samples to be evaluated.problem_file(str): The path to the file containing the problems to evaluate against.k(str, optional): The list of values for k, as a comma-separated string (default: "1,10,100").n_workers(int, optional): The number of worker threads to use (default: 4).timeout(float, optional): The timeout for each evaluation in seconds (default: 3.0).save_completions_dir(str, optional): Directory path where the samples will be stored as .cu files (default: "" i.e not saved)
Dataset
For more information about the dataset see DATASET_CARD.md.
Contributing
See contributing.md for development instructions.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file nvidia_compute_eval-26.3-py3-none-any.whl.
File metadata
- Download URL: nvidia_compute_eval-26.3-py3-none-any.whl
- Upload date:
- Size: 283.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
731be070da2ac2120a164fff7345ce22bd0d74e6c464b63e595c09847e96f1c1
|
|
| MD5 |
df1ad28781503a324088baa5366c1e51
|
|
| BLAKE2b-256 |
ae834a8fefe6236a5950452aff4340bd9fc92b02d67e49f511d20407b69fcf23
|