Skip to main content

User-friendly library to enhance PyCUDA functionality

Project description

PyCUDA Plus

PyCUDA Plus is an enhanced Python library built on top of PyCUDA, designed to simplify GPU programming and execution. It provides high-level abstractions and utilities for working with CUDA kernels, memory management, and context handling, allowing developers to focus on writing efficient CUDA code without dealing with low-level details.


Key Features

  • Kernel Management: Compile, load, and execute custom CUDA kernels easily with the KernelExecutor.
  • Memory Management: Simplified allocation and transfer of device and host memory using the MemoryManager.
  • Context Handling: Seamless setup and teardown of CUDA contexts with the CudaContextManager.
  • Error Checking: Built-in error detection and reporting via CudaErrorChecker.
  • Utility Functions: Prebuilt kernels, NumPy support, and grid/block configuration helpers for common operations.
  • Grid/Block Configuration: Automate grid and block size calculations for CUDA kernels using GridBlockConfig.
  • Performance Profiling: Measure execution time of CUDA kernels with PerformanceProfiler.

Installation

To install the pycuda_plus library, run:

pip install pycuda_plus

Ensure you have the following prerequisites installed:

  • CUDA Toolkit
  • PyCUDA
  • Compatible NVIDIA GPU drivers

Getting Started

Example 1: Vector Addition

import numpy as np
from pycuda_plus.core.kernel import KernelExecutor
from pycuda_plus.core.memory import MemoryManager
from pycuda_plus.utils.prebuilt_kernels import get_kernel
from pycuda_plus.core.context import CudaContextManager
from pycuda_plus.core.error import CudaErrorChecker

def vector_addition_example(N):
    kernel = KernelExecutor()
    memory = MemoryManager()
    context_manager = CudaContextManager()
    context_manager.initialize_context()

    try:
        A = np.random.rand(N).astype(np.float32)
        B = np.random.rand(N).astype(np.float32)
        C = np.zeros(N, dtype=np.float32)

        vector_add = get_kernel('vector_add')
        d_A = memory.to_gpu(A)
        d_B = memory.to_gpu(B)
        d_C = memory.allocate_device_array(C.shape, dtype=np.float32)

        block_size = 256
        grid_size = (N + block_size - 1) // block_size

        kernel.launch_kernel(vector_add, (grid_size, 1, 1), (block_size, 1, 1), d_A, d_B, d_C, np.int32(N))

        error_checker = CudaErrorChecker()
        error_checker.check_errors()

        memory.copy_to_host(d_C, C)
        return C
    finally:
        context_manager.finalize_context()

if __name__ == "__main__":
    N = 1000000
    result = vector_addition_example(N)
    print(f"Vector addition result (first 5 elements): {result[:5]}")

Example 2: Matrix Multiplication

import numpy as np
from pycuda_plus.core.kernel import KernelExecutor
from pycuda_plus.core.memory import MemoryManager
from pycuda_plus.core.context import CudaContextManager
from pycuda_plus.core.error import CudaErrorChecker

matrix_multiply_kernel = """
__global__ void matrix_multiply(float *A, float *B, float *C, int N) {
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    if (row < N && col < N) {
        float value = 0;
        for (int k = 0; k < N; ++k) {
            value += A[row * N + k] * B[k * N + col];
        }
        C[row * N + col] = value;
    }
}
"""

def matrix_multiply_example(N):
    kernel = KernelExecutor()
    memory = MemoryManager()
    context_manager = CudaContextManager()
    context_manager.initialize_context()

    try:
        A = np.random.rand(N, N).astype(np.float32)
        B = np.random.rand(N, N).astype(np.float32)
        C = np.zeros((N, N), dtype=np.float32)

        compiled_kernel = kernel.compile_kernel(matrix_multiply_kernel, 'matrix_multiply')
        d_A = memory.to_gpu(A)
        d_B = memory.to_gpu(B)
        d_C = memory.allocate_device_array(C.shape, dtype=np.float32)

        block_size = 16
        grid_size = (N + block_size - 1) // block_size

        kernel.launch_kernel(compiled_kernel, (grid_size, grid_size, 1), (block_size, block_size, 1), d_A, d_B, d_C, np.int32(N))

        error_checker = CudaErrorChecker()
        error_checker.check_errors()

        memory.copy_to_host(d_C, C)
        return C
    finally:
        context_manager.finalize_context()

if __name__ == "__main__":
    N = 512
    result = matrix_multiply_example(N)
    print(f"Matrix multiplication result (first 5x5 elements):\n{result[:5, :5]}")

Example 3: Matrix Addition with Profiling

import numpy as np
from pycuda_plus.core.kernel import KernelExecutor
from pycuda_plus.core.memory import MemoryManager
from pycuda_plus.core.grid_block import GridBlockConfig
from pycuda_plus.core.profiler import PerformanceProfiler
from pycuda_plus.core.context import CudaContextManager

matrix_addition_kernel = """
__global__ void matrix_add(float *A, float *B, float *C, int rows, int cols) {
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;

    if (row < rows && col < cols) {
        int idx = row * cols + col;
        C[idx] = A[idx] + B[idx];
    }
}
"""

def matrix_addition_with_profiling(rows, cols):
    kernel_executor = KernelExecutor()
    memory_manager = MemoryManager()
    grid_config = GridBlockConfig(threads_per_block=256)
    profiler = PerformanceProfiler()
    context_manager = CudaContextManager()

    context_manager.initialize_context()

    try:
        A = np.random.rand(rows, cols).astype(np.float32)
        B = np.random.rand(rows, cols).astype(np.float32)
        C = np.zeros((rows, cols), dtype=np.float32)

        d_A = memory_manager.allocate_device_array(A.shape, dtype=np.float32)
        d_B = memory_manager.allocate_device_array(B.shape, dtype=np.float32)
        d_C = memory_manager.allocate_device_array(C.shape, dtype=np.float32)
        memory_manager.copy_to_device(A, d_A)
        memory_manager.copy_to_device(B, d_B)

        compiled_kernel = kernel_executor.compile_kernel(matrix_addition_kernel, 'matrix_add')

        total_elements = rows * cols
        grid, block = grid_config.auto_config(total_elements)

        grid = (grid[0], grid[0], 1)
        block = (block[0], 1, 1)

        execution_time = profiler.profile_kernel(
            compiled_kernel, grid, block, d_A, d_B, d_C, np.int32(rows), np.int32(cols)
        )
        print(f"Matrix addition kernel execution time: {execution_time:.6f} seconds")

        memory_manager.copy_to_host(d_C, C)

        return C

    finally:
        context_manager.finalize_context()

if __name__ == "__main__":
    rows, cols = 1024, 1024
    result = matrix_addition_with_profiling(rows, cols)
    print(f"Matrix addition result (first 5x5 elements):\n{result[:5, :5]}")

API Documentation

Core Modules

  1. KernelExecutor

    • Compile and launch CUDA kernels.
    • Example:
      kernel_executor = KernelExecutor()
      compiled_kernel = kernel_executor.compile_kernel(kernel_code, kernel_name)
      kernel_executor.launch_kernel(compiled_kernel, grid, block, *args)
      
  2. MemoryManager

    • Allocate, manage, and transfer memory between host and device.
    • Example:
      memory_manager = MemoryManager()
      device_array = memory_manager.allocate_device_array(shape, dtype)
      memory_manager.copy_to_device(host_array, device_array)
      memory_manager.copy_to_host(device_array, host_array)
      
  3. CudaContextManager

    • Simplify CUDA context setup and teardown.
    • Example:
      context_manager = CudaContextManager()
      context_manager.initialize_context()
      context_manager.finalize_context()
      
  4. CudaErrorChecker

    • Check for CUDA errors during kernel execution.
    • Example:
      error_checker = CudaErrorChecker()
      error_checker.check_errors()
      
  5. GridBlockConfig

    • Automate grid and block size calculation.
    • Example:
      grid_config = GridBlockConfig(threads_per_block=256)
      grid, block = grid_config.auto_config(shape)
      print(f"Grid: {grid}, Block: {block}")
      
  6. PerformanceProfiler

    • Measure execution time of CUDA kernels.
    • Example:
      profiler = PerformanceProfiler()
      execution_time = profiler.profile_kernel(kernel, grid, block, *args)
      print(f"Kernel execution time: {execution_time:.6f} seconds")
      

Utility Modules

  • numpy_support: Convert between NumPy arrays and GPU memory.
  • prebuilt_kernels: Access commonly used CUDA kernels.
  • grid_block: Helpers for calculating grid and block dimensions.
  • profiler: Tools for profiling CUDA kernel execution.

Contributing

Contributions are welcome! Please open issues or submit pull requests on the GitHub repository.


License

PyCUDA Plus is licensed under the MIT License. See the LICENSE file for details.


Acknowledgments

Built on the foundation of PyCUDA, with additional utilities for enhanced usability and performance.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pycuda_plus-0.1.4.tar.gz (13.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pycuda_plus-0.1.4-py3-none-any.whl (15.7 kB view details)

Uploaded Python 3

File details

Details for the file pycuda_plus-0.1.4.tar.gz.

File metadata

  • Download URL: pycuda_plus-0.1.4.tar.gz
  • Upload date:
  • Size: 13.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.8.20

File hashes

Hashes for pycuda_plus-0.1.4.tar.gz
Algorithm Hash digest
SHA256 3abc530c8bca710f8a358f93312309d4a159471cc8ed073db6f866fa5606764a
MD5 28b3c0ae2da2efe5e04d136b42e9e6e4
BLAKE2b-256 5732f5980a0a4484d9a2c362bd46acb05021c57c719d0c0bb0b6f2d39c23ad1c

See more details on using hashes here.

File details

Details for the file pycuda_plus-0.1.4-py3-none-any.whl.

File metadata

  • Download URL: pycuda_plus-0.1.4-py3-none-any.whl
  • Upload date:
  • Size: 15.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.8.20

File hashes

Hashes for pycuda_plus-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 af8c393b530b0fb3807f6c78c83237f5837c2ff30ef1c3c6aa18fcfabef0865f
MD5 b1de0f53046274194fa7c7596428694a
BLAKE2b-256 35cd32a3abd2851d14522c696b409ab3d591865742094d1fbbcec80f2513590e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page