kernel-tuner

An easy to use CUDA/OpenCL kernel tuner in Python

These details have not been verified by PyPI

Project links

Project description

Kernel Tuner simplifies the software development of optimized and auto-tuned GPU programs, by enabling Python-based unit testing of GPU code and making it easy to develop scripts for auto-tuning GPU kernels. This also means no extensive changes and no new dependencies are required in the kernel code. The kernels can still be compiled and used as normal from any host programming language.

Kernel Tuner provides a comprehensive solution for auto-tuning GPU programs, supporting auto-tuning of user-defined parameters in both host and device code, supporting output verification of all benchmarked kernels during tuning, as well as many optimization strategies to speed up the tuning process.

Documentation

The full documentation is available here.

Installation

The easiest way to install the Kernel Tuner is using pip:

To tune CUDA kernels:

First, make sure you have the CUDA Toolkit installed
Then type: pip install kernel_tuner[cuda]

To tune OpenCL kernels:

First, make sure you have an OpenCL compiler for your intended OpenCL platform
Then type: pip install kernel_tuner[opencl]

Or both:

pip install kernel_tuner[cuda,opencl]

More information about how to install Kernel Tuner and its dependencies can be found in the installation guide.

Example usage

The following shows a simple example for tuning a CUDA kernel:

kernel_string = """
__global__ void vector_add(float *c, float *a, float *b, int n) {
    int i = blockIdx.x * block_size_x + threadIdx.x;
    if (i<n) {
        c[i] = a[i] + b[i];
    }
}
"""

size = 10000000

a = numpy.random.randn(size).astype(numpy.float32)
b = numpy.random.randn(size).astype(numpy.float32)
c = numpy.zeros_like(b)
n = numpy.int32(size)
args = [c, a, b, n]

tune_params = dict()
tune_params["block_size_x"] = [32, 64, 128, 256, 512]

tune_kernel("vector_add", kernel_string, size, args, tune_params)

The exact same Python code can be used to tune an OpenCL kernel:

kernel_string = """
__kernel void vector_add(__global float *c, __global float *a, __global float *b, int n) {
    int i = get_global_id(0);
    if (i<n) {
        c[i] = a[i] + b[i];
    }
}
"""

The Kernel Tuner will detect the kernel language and select the right compiler and runtime. For every kernel in the parameter space, the Kernel Tuner will insert C preprocessor defines for the tunable parameters, compile, and benchmark the kernel. The timing results will be printed to the console, but are also returned by tune_kernel to allow further analysis. Note that this is just the default behavior, what and how tune_kernel does exactly is controlled through its many optional arguments.

You can find many - more extensive - example codes, in the examples directory and in the Kernel Tuner documentation pages.

Search strategies for tuning

Kernel Tuner supports many optimization algorithms to accelerate the auto-tuning process. Currently implemented search algorithms are: Brute Force (default), Nelder-Mead, Powell, CG, BFGS, L-BFGS-B, TNC, COBYLA, SLSQP, Random Search, Basinhopping, Differential Evolution, a Genetic Algorithm, Particle Swarm Optimization, the Firefly Algorithm, Simulated Annealing, Dual Annealing, Iterative Local Search, Multi-start Local Search, and Bayesian Optimization.

Using a search strategy is easy, you only need to specify to tune_kernel which strategy and method you would like to use, for example strategy="genetic_algorithm" or strategy="basinhopping". For a full overview of the supported search strategies and methods please see the Kernel Tuner documentation on Optimization Strategies.

Tuning host and kernel code

It is possible to tune for combinations of tunable parameters in both host and kernel code. This allows for a number of powerfull things, such as tuning the number of streams for a kernel that uses CUDA Streams or OpenCL Command Queues to overlap transfers between host and device with kernel execution. This can be done in combination with tuning the parameters inside the kernel code. See the convolution_streams example code and the documentation for a detailed explanation of the kernel tuner Python script.

Correctness verification

Optionally, you can let the kernel tuner verify the output of every kernel it compiles and benchmarks, by passing an answer list. This list matches the list of arguments to the kernel, but contains the expected output of the kernel. Input arguments are replaced with None.

answer = [a+b, None, None]  # the order matches the arguments (in args) to the kernel
tune_kernel("vector_add", kernel_string, size, args, tune_params, answer=answer)

Contributing

Please see the Contributions Guide.

Citation

If you use Kernel Tuner in research or research software, please cite the most relevant among the following publications:

@article{kerneltuner,
  author  = {Ben van Werkhoven},
  title   = {Kernel Tuner: A search-optimizing GPU code auto-tuner},
  journal = {Future Generation Computer Systems},
  year = {2019},
  volume  = {90},
  pages = {347-358},
  url = {https://www.sciencedirect.com/science/article/pii/S0167739X18313359},
  doi = {https://doi.org/10.1016/j.future.2018.08.004}
}

@article{willemsen2021bayesian,
  author = {Willemsen, Floris-Jan and Van Nieuwpoort, Rob and Van Werkhoven, Ben},
  title = {Bayesian Optimization for auto-tuning GPU kernels},
  journal = {International Workshop on Performance Modeling, Benchmarking and Simulation
     of High Performance Computer Systems (PMBS) at Supercomputing (SC21)},
  year = {2021},
  url = {https://arxiv.org/abs/2111.14991}
}

@article{schoonhoven2022benchmarking,
  title={Benchmarking optimization algorithms for auto-tuning GPU kernels},
  author={Schoonhoven, Richard and van Werkhoven, Ben and Batenburg, K Joost},
  journal={IEEE Transactions on Evolutionary Computation},
  year={2022},
  publisher={IEEE},
  url = {https://arxiv.org/abs/2210.01465}
}

@article{schoonhoven2022going,
  author = {Schoonhoven, Richard and Veenboer, Bram, and van Werkhoven, Ben and Batenburg, K Joost},
  title = {Going green: optimizing GPUs for energy efficiency through model-steered auto-tuning},
  journal = {International Workshop on Performance Modeling, Benchmarking and Simulation
     of High Performance Computer Systems (PMBS) at Supercomputing (SC22)},
  year = {2022},
  url = {https://arxiv.org/abs/2211.07260}
}

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.0

Apr 4, 2024

1.0.0b6 pre-release

Nov 8, 2023

1.0.0b5 pre-release

Nov 1, 2023

1.0.0b4 pre-release

Oct 22, 2023

1.0.0b3 pre-release

Oct 12, 2023

1.0.0b2 pre-release

Oct 11, 2023

1.0.0b1 pre-release

Oct 11, 2023

This version

0.4.5

Jun 1, 2023

0.4.4

Mar 9, 2023

0.4.3

Oct 19, 2022

0.4.2

May 23, 2022

0.4.1

Sep 10, 2021

0.4.0

Apr 9, 2021

0.3.2

Nov 4, 2020

0.3.1

Jun 18, 2020

0.3.0

Feb 14, 2020

0.2.0

Nov 16, 2018

0.1.9

Apr 18, 2018

0.1.8

Nov 23, 2017

0.1.7

Nov 10, 2017

0.1.6

Aug 24, 2017

0.1.5

Jul 21, 2017

0.1.4

Jun 14, 2017

0.1.3

Apr 6, 2017

0.1.2

Mar 29, 2017

0.1.1

Feb 10, 2017

0.1.0

Nov 2, 2016

0.1.0rc0 pre-release

Nov 8, 2016

0.1.0b0 pre-release

Nov 2, 2016

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

kernel_tuner-0.4.5-py3-none-any.whl (122.5 kB view hashes)

Uploaded Jun 1, 2023 Python 3

Hashes for kernel_tuner-0.4.5-py3-none-any.whl

Hashes for kernel_tuner-0.4.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3c77077288454a26d13403b98408416f3ba2114c68c34a7ef534cac2e5de589c`
MD5	`98ada7bdc6db7d5da24c56bcd9c4e0e3`
BLAKE2b-256	`f2a89dee018d0057bc316674128be83202ffc59b324b357e535049ec03718595`