User-friendly library to enhance PyCUDA functionality

These details have not been verified by PyPI

Project links

Development Status
- 4 - Beta
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3
Topic
- Software Development :: Libraries :: Python Modules

Project description

PyCUDA Plus

PyCUDA Plus is an enhanced Python library built on top of PyCUDA, designed to simplify GPU programming and execution. It provides high-level abstractions and utilities for working with CUDA kernels, memory management, and context handling, allowing developers to focus on writing efficient CUDA code without dealing with low-level details.

Key Features

Kernel Management: Compile, load, and execute custom CUDA kernels easily with the KernelExecutor.
Memory Management: Simplified allocation and transfer of device and host memory using the MemoryManager.
Context Handling: Seamless setup and teardown of CUDA contexts with the CudaContextManager.
Error Checking: Built-in error detection and reporting via CudaErrorChecker.
Utility Functions: Prebuilt kernels, NumPy support, and grid/block configuration helpers for common operations.
Grid/Block Configuration: Automate grid and block size calculations for CUDA kernels using GridBlockConfig.
Performance Profiling: Measure execution time of CUDA kernels with PerformanceProfiler.

Installation

To install the pycuda_plus library, run:

pip install pycuda_plus

Ensure you have the following prerequisites installed:

CUDA Toolkit
PyCUDA
Compatible NVIDIA GPU drivers

Getting Started

Example 1: Vector Addition

import numpy as np
from pycuda_plus.core.kernel import KernelExecutor
from pycuda_plus.core.memory import MemoryManager
from pycuda_plus.utils.prebuilt_kernels import get_kernel
from pycuda_plus.core.context import CudaContextManager
from pycuda_plus.core.error import CudaErrorChecker


def vector_addition_example(N):
    kernel = KernelExecutor()
    memory_manager = MemoryManager()  # Using the MemoryManager
    context_manager = CudaContextManager()
    context_manager.initialize_context()

    try:
        A = np.random.rand(N).astype(np.float32)
        B = np.random.rand(N).astype(np.float32)
        C = np.zeros(N, dtype=np.float32)

        vector_add = get_kernel('vector_add')

        # Allocate memory on the GPU
        d_A = memory_manager.allocate_device_array(A.shape, dtype=np.float32)
        d_B = memory_manager.allocate_device_array(B.shape, dtype=np.float32)
        d_C = memory_manager.allocate_device_array(C.shape, dtype=np.float32)

        # Copy data from host to GPU
        memory_manager.copy_to_device(A, d_A)
        memory_manager.copy_to_device(B, d_B)

        block_size = 256
        grid_size = (N + block_size - 1) // block_size

        # Launch the kernel
        kernel.launch_kernel(vector_add, (grid_size, 1, 1), (block_size, 1, 1), d_A, d_B, d_C, np.int32(N))

        error_checker = CudaErrorChecker()
        error_checker.check_errors()

        # Copy the result back to host
        memory_manager.copy_to_host(d_C, C)
        return C
    finally:
        context_manager.finalize_context()


if __name__ == "__main__":
    N = 1000000
    result = vector_addition_example(N)
    print(f"Vector addition result (first 5 elements): {result[:5]}")

Example 2: Matrix Multiplication

import numpy as np
from pycuda_plus.core.kernel import KernelExecutor
from pycuda_plus.core.memory import MemoryManager
from pycuda_plus.core.context import CudaContextManager
from pycuda_plus.core.error import CudaErrorChecker

matrix_multiply_kernel = """
__global__ void matrix_multiply(float *A, float *B, float *C, int N) {
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    if (row < N && col < N) {
        float value = 0;
        for (int k = 0; k < N; ++k) {
            value += A[row * N + k] * B[k * N + col];
        }
        C[row * N + col] = value;
    }
}
"""

def matrix_multiply_example(N):
    kernel = KernelExecutor()
    memory_manager = MemoryManager()
    context_manager = CudaContextManager()
    context_manager.initialize_context()

    try:
        # Host arrays
        A = np.random.rand(N, N).astype(np.float32)
        B = np.random.rand(N, N).astype(np.float32)
        C = np.zeros((N, N), dtype=np.float32)

        # Compile the kernel
        compiled_kernel = kernel.compile_kernel(matrix_multiply_kernel, 'matrix_multiply')

        # Allocate memory on the device
        d_A = memory_manager.allocate_device_array(A.shape, dtype=np.float32)
        d_B = memory_manager.allocate_device_array(B.shape, dtype=np.float32)
        d_C = memory_manager.allocate_device_array(C.shape, dtype=np.float32)

        # Copy data to device
        memory_manager.copy_to_device(A, d_A)
        memory_manager.copy_to_device(B, d_B)

        # Configure grid and block sizes
        block_size = 16
        grid_size = (N + block_size - 1) // block_size

        # Launch the kernel
        kernel.launch_kernel(
            compiled_kernel,
            (grid_size, grid_size, 1),
            (block_size, block_size, 1),
            d_A, d_B, d_C, np.int32(N)
        )

        # Error checking
        error_checker = CudaErrorChecker()
        error_checker.check_errors()

        # Copy the result back to the host
        memory_manager.copy_to_host(d_C, C)
        return C

    finally:
        # Finalize the context
        context_manager.finalize_context()

if __name__ == "__main__":
    N = 512
    result = matrix_multiply_example(N)
    print(f"Matrix multiplication result (first 5x5 elements):\n{result[:5, :5]}")

Example 3: Matrix Addition with Profiling

import numpy as np
from pycuda_plus.core.kernel import KernelExecutor
from pycuda_plus.core.memory import MemoryManager
from pycuda_plus.core.grid_block import GridBlockConfig
from pycuda_plus.core.profiler import PerformanceProfiler
from pycuda_plus.core.context import CudaContextManager

matrix_addition_kernel = """
__global__ void matrix_add(float *A, float *B, float *C, int rows, int cols) {
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;

    if (row < rows && col < cols) {
        int idx = row * cols + col;
        C[idx] = A[idx] + B[idx];
    }
}
"""

def matrix_addition_with_profiling(rows, cols):
    kernel_executor = KernelExecutor()
    memory_manager = MemoryManager()
    grid_config = GridBlockConfig(threads_per_block=256)
    profiler = PerformanceProfiler()
    context_manager = CudaContextManager()

    context_manager.initialize_context()

    try:
        A = np.random.rand(rows, cols).astype(np.float32)
        B = np.random.rand(rows, cols).astype(np.float32)
        C = np.zeros((rows, cols), dtype=np.float32)

        d_A = memory_manager.allocate_device_array(A.shape, dtype=np.float32)
        d_B = memory_manager.allocate_device_array(B.shape, dtype=np.float32)
        d_C = memory_manager.allocate_device_array(C.shape, dtype=np.float32)
        memory_manager.copy_to_device(A, d_A)
        memory_manager.copy_to_device(B, d_B)

        compiled_kernel = kernel_executor.compile_kernel(matrix_addition_kernel, 'matrix_add')

        total_elements = rows * cols
        grid, block = grid_config.auto_config(total_elements)

        grid = (grid[0], grid[0], 1)
        block = (block[0], 1, 1)

        execution_time = profiler.profile_kernel(
            compiled_kernel, grid, block, d_A, d_B, d_C, np.int32(rows), np.int32(cols)
        )
        print(f"Matrix addition kernel execution time: {execution_time:.6f} seconds")

        memory_manager.copy_to_host(d_C, C)

        return C

    finally:
        context_manager.finalize_context()

if __name__ == "__main__":
    rows, cols = 1024, 1024
    result = matrix_addition_with_profiling(rows, cols)
    print(f"Matrix addition result (first 5x5 elements):\n{result[:5, :5]}")

Example 4: Elementwise Multiplication

import numpy as np
from pycuda_plus.core.kernel import KernelExecutor
from pycuda_plus.core.memory import MemoryManager
from pycuda_plus.utils.numpy_support import NumpyHelper
from pycuda_plus.core.error import CudaErrorChecker
from pycuda_plus.core.context import CudaContextManager

# Custom CUDA kernel for elementwise multiplication
kernel_code = """
__global__ void multiply_arrays(float *a, float *b, float *c, int N) {
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    if (idx < N) {
        c[idx] = a[idx] * b[idx];
    }
}
"""

def example_using_numpy_helper(N):
    # Instantiate required components
    kernel_executor = KernelExecutor()
    memory_manager = MemoryManager()
    numpy_helper = NumpyHelper()  # We'll keep this in case we need other helper functions
    cuda_error_checker = CudaErrorChecker()
    context_manager = CudaContextManager()

    # Initialize the CUDA context
    context_manager.initialize_context()

    try:
        # Host arrays
        host_array1 = np.random.rand(N).astype(np.float32)
        host_array2 = np.random.rand(N).astype(np.float32)
        host_result = np.zeros(N, dtype=np.float32)

        # Compile kernel
        multiply_kernel = kernel_executor.compile_kernel(kernel_code, 'multiply_arrays')

        # Allocate device memory using MemoryManager (from pycuda_plus)
        d_array1 = memory_manager.allocate_device_array(host_array1.shape, dtype=np.float32)
        d_array2 = memory_manager.allocate_device_array(host_array2.shape, dtype=np.float32)
        d_result = memory_manager.allocate_device_array(host_result.shape, dtype=np.float32)

        # Copy host data to device memory
        memory_manager.copy_to_device(host_array1, d_array1)
        memory_manager.copy_to_device(host_array2, d_array2)

        # Launch kernel
        block_size = 256
        grid_size = (N + block_size - 1) // block_size
        kernel_executor.launch_kernel(
            multiply_kernel,
            (grid_size, 1, 1),
            (block_size, 1, 1),
            d_array1, d_array2, d_result, np.int32(N)
        )

        # Synchronize and check for errors
        cuda_error_checker.check_errors()

        # Copy result back to host
        memory_manager.copy_to_host(d_result, host_result)

        # Generate a patterned array using NumpyHelper (this can use MemoryManager)
        d_patterned_array = numpy_helper.generate_patterned_array((N,), 'range')
        patterned_array = numpy_helper.batch_copy_to_host([d_patterned_array])[0]

        # Perform an elementwise operation using NumpyHelper
        d_elementwise_result = numpy_helper.elementwise_operation(d_array1, d_array2, 'add')
        elementwise_result = numpy_helper.batch_copy_to_host([d_elementwise_result])[0]

        # Return all results
        return {
            "elementwise_multiplication": host_result,
            "patterned_array": patterned_array[:10],  # Show first 10 elements
            "elementwise_addition": elementwise_result[:10],  # Show first 10 elements
        }

    finally:
        # Finalize CUDA context
        context_manager.finalize_context()

if __name__ == "__main__":
    N = 10000  # Array size
    results = example_using_numpy_helper(N)

    # Print results
    print("Elementwise multiplication (first 10 elements):", results["elementwise_multiplication"][:10])
    print("Patterned array (first 10 elements):", results["patterned_array"])
    print("Elementwise addition (first 10 elements):", results["elementwise_addition"])

API Documentation

Core Modules

KernelExecutor

Compile and launch CUDA kernels.

Example:

kernel_executor = KernelExecutor()
compiled_kernel = kernel_executor.compile_kernel(kernel_code, kernel_name)
kernel_executor.launch_kernel(compiled_kernel, grid, block, *args)

MemoryManager

Allocate, manage, and transfer memory between host and device.

Example:

memory_manager = MemoryManager()
device_array = memory_manager.allocate_device_array(shape, dtype)
memory_manager.copy_to_device(host_array, device_array)
memory_manager.copy_to_host(device_array, host_array)

CudaContextManager

Simplify CUDA context setup and teardown.

Example:

context_manager = CudaContextManager()
context_manager.initialize_context()
context_manager.finalize_context()

CudaErrorChecker
- Check for CUDA errors during kernel execution.
- Example:
```
error_checker = CudaErrorChecker()
error_checker.check_errors()
```

GridBlockConfig

Automate grid and block size calculation.

Example:

grid_config = GridBlockConfig(threads_per_block=256)
grid, block = grid_config.auto_config(shape)
print(f"Grid: {grid}, Block: {block}")

PerformanceProfiler

Measure execution time of CUDA kernels.

Example:

profiler = PerformanceProfiler()
execution_time = profiler.profile_kernel(kernel, grid, block, *args)
print(f"Kernel execution time: {execution_time:.6f} seconds")

7. `NumpyHelper`

Purpose: Provide advanced utilities for integrating NumPy arrays with CUDA device memory using pycuda_plus.
Functions:
- reshape_device_array(device_array, new_shape): Reshape a device array into a new shape without changing its contents.
- elementwise_operation(device_array1, device_array2, operation): Perform element-wise operations like addition, subtraction, multiplication, or division on device arrays.
- generate_patterned_array(shape, pattern): Generate patterned device arrays (e.g., range, linspace) for device-side operations.
- batch_copy_to_device(numpy_arrays): Batch copy multiple NumPy arrays to device memory.
- batch_copy_to_host(device_arrays): Batch copy multiple device arrays to host memory.

Example:

numpy_helper = NumpyHelper()
d_patterned_array = numpy_helper.generate_patterned_array((1000,), 'range')
d_array1, d_array2 = numpy_helper.batch_copy_to_device([array1, array2])
d_result = numpy_helper.elementwise_operation(d_array1, d_array2, 'add')
result = numpy_helper.batch_copy_to_host([d_result])[0]

Utility Modules

numpy_support: Convert between NumPy arrays and GPU memory.
prebuilt_kernels: Access commonly used CUDA kernels.
grid_block: Helpers for calculating grid and block dimensions.
profiler: Tools for profiling CUDA kernel execution.

Contributing

Contributions are welcome! Please open issues or submit pull requests on the GitHub repository.

License

PyCUDA Plus is licensed under the MIT License. See the LICENSE file for details.

Acknowledgments

Built on the foundation of PyCUDA, with additional utilities for enhanced usability and performance.

Project details

These details have not been verified by PyPI

Project links

Development Status
- 4 - Beta
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3
Topic
- Software Development :: Libraries :: Python Modules

Release history Release notifications | RSS feed

0.3.1

Jan 30, 2025

0.3.0

Jan 27, 2025

0.2.9

Jan 27, 2025

0.2.8

Jan 27, 2025

0.2.7

Jan 27, 2025

0.2.6

Jan 27, 2025

0.2.5

Jan 25, 2025

0.2.4

Jan 24, 2025

0.2.3

Jan 24, 2025

0.2.2

Jan 22, 2025

0.2.1

Jan 22, 2025

0.2.0

Jan 22, 2025

0.1.9

Jan 19, 2025

0.1.8

Jan 19, 2025

0.1.7

Jan 19, 2025

This version

0.1.6

Jan 19, 2025

0.1.5

Jan 19, 2025

0.1.4

Jan 19, 2025

0.1.3

Jan 19, 2025

0.1.2

Jan 15, 2025

0.1.1

Jan 15, 2025

0.1.0

Jan 14, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pycuda_plus-0.1.6.tar.gz (15.9 kB view details)

Uploaded Jan 19, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pycuda_plus-0.1.6-py3-none-any.whl (18.1 kB view details)

Uploaded Jan 19, 2025 Python 3

File details

Details for the file pycuda_plus-0.1.6.tar.gz.

File metadata

Download URL: pycuda_plus-0.1.6.tar.gz
Upload date: Jan 19, 2025
Size: 15.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.8.20

File hashes

Hashes for pycuda_plus-0.1.6.tar.gz
Algorithm	Hash digest
SHA256	`b8ed5c983e79e658f1d2ddb1d4bd90629e232eb5ddd878cbfbd628e3dca33515`
MD5	`53c7b54640bd7dc39f423770565ce73c`
BLAKE2b-256	`357b791d73f3800e6c6424fa404479c0693ba52357001903d2e01762f641d3b2`

See more details on using hashes here.

File details

Details for the file pycuda_plus-0.1.6-py3-none-any.whl.

File metadata

Download URL: pycuda_plus-0.1.6-py3-none-any.whl
Upload date: Jan 19, 2025
Size: 18.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.8.20

File hashes

Hashes for pycuda_plus-0.1.6-py3-none-any.whl
Algorithm	Hash digest
SHA256	`902ccd6e938e3c65058967ec8a2f605a695b7dd8c47a300ee6cf1d939ea50f96`
MD5	`9bec475418c897f269acec91837d8a5b`
BLAKE2b-256	`c4c57c3233d3c39eae0e69f6f9444722d399f9ae6857a05a2924160545bae7e0`

See more details on using hashes here.

pycuda-plus 0.1.6

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

PyCUDA Plus

Key Features

Installation

Getting Started

Example 1: Vector Addition

Example 2: Matrix Multiplication

Example 3: Matrix Addition with Profiling

Example 4: Elementwise Multiplication

API Documentation

Core Modules

7. NumpyHelper

Utility Modules

Contributing

License

Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

7. `NumpyHelper`