Skip to main content

PyCUDA based PyTorch Extension Made Easy

Project description

Logo

PyCUDA based PyTorch Extension Made Easy


In a word, cutex bridges PyCUDA's just-in-time compilation with PyTorch's Tensor type.

cutex.SourceModule extends pycuda's SourceModule in following ways:

  • Designed to work seemlessly with pytorch Tensor type, Data-Distributed Parallel (DDP), and autograd.Function API.
  • Support efficient multi-dimensional torch.Tensor access with (efficient & optional) out-of-boundary check.
  • Enhanced automatic type conversion and error messages.

cutex.SourceModule works differently compared to PyTorch's official cuda extension guide in following ways:

  • It compiles lightning fast! Especially suitable for rapidly developing your favorite new algorithm.
  • Without boilerplate cpp wrappers, every user code goes within one python file.
  • It use raw CUDA syntax so that PyTorch's c++ API is not available, it is recommended to use either raw CUDA with cutex or python API with pytorch.

Example (inline CUDA API)

This is a new high level API for writing custom kernels since v0.3.0. You omit the signature of the kernel function and cutex.inline() will compile and run it according to the context on the cuda device. This facillitates a fluent switching between pytorch and pycuda.

import cutex
import torch
from torch import Tensor


def matmul(A:Tensor, B:Tensor):
    M, J = A.size()
    K, N = B.size()
    assert J == K
    gridDim = (cutex.ceildiv(N, 16), cutex.ceildiv(M, 16), 1)
    blockDim = (16, 16, 1)
    C = torch.empty((M, N), dtype=A.dtype, device=A.device)
    cutex.inline("""
    int m = blockIdx.y * blockDim.y + threadIdx.y;
    int n = blockIdx.x * blockDim.x + threadIdx.x;
    float v = 0.f;
    if (m >= M || n >= N) return;
    for (int k = 0; k < K; ++k) {
        v += A[m][k] * B[k][n];
    }
    C[m][n] = v;
    """, boundscheck=False)  # all local vars are captured into the kernel except for those with unknown types.
    return C


def test():
    M, N, K = 4, 4, 1
    A = torch.rand((M, K), dtype=torch.float32).cuda()
    B = torch.rand((K, N), dtype=torch.float32).cuda()
    torch.testing.assert_close(matmul(A, B), torch.mm(A, B))
    print(matmul(A, B)) 


test()

Local variables of Tensor and common scalar types (int, float, etc.) and special ones gridDim and blockDim are captured into the inline execution, as if they were in the same scope. The order of defining them does not matter, only have to be assigned before the inline execution. Multiple inline execution in the same python function is legal. When doing so, make gridDim and blockDim update their value before the next execution.

The tensors can be acccessed element-wise using multi-dimensional squared brackets [] as illustrated in the above example. It can be read and write, and the modifications would reflected directly to the pytorch tensor on cuda devices. By default, with the boundscheck option on, these brackets will check for out of bound error. While this is very useful for debugging novel algorithms, it will make use of more registers in the SM, so if you want to make full use of the SM register resources, e.g. using maximum block threads, you need to turn boundscheck off for best performance.

Unless explicitly specified, the float type in the CUDA part will be automatically replaced to the same type as the first local Tensor variable with float dtype, in the above example, it would be aligned with A.

Example (lower level SourceModule API)

The following example demonstrates a vanilla matrix multiplication implementation for pytorch tensor but written in pure cuda. As you may happily notice, pytorch is responsible for allocation of new Tensors instead of in the cuda code, and the elements of tensors can be read and modified inside the kernel function.

import torch
import cutex

M, N, K = 4, 4, 1
a = torch.rand((M, K), dtype=torch.float32).cuda()
b = torch.rand((K, N), dtype=torch.float32).cuda()
c = torch.empty((M, N), dtype=torch.float32).cuda()

kernels = cutex.SourceModule("""
//cuda
__global__ void matmul(Tensor<float, 2> a, Tensor<float, 2> b, Tensor<float, 2> c, int M, int N, int K) {
    int m = blockIdx.y * blockDim.y + threadIdx.y;
    int n = blockIdx.x * blockDim.x + threadIdx.x;
    float v = 0.f;
    if (m >= M || n >= N) return; // you can also write `a.size(0)` instead of `M`, `b.size(1)` instead of `N`
    for (int k = 0; k < K; ++k) { // you can also write `a.size(1)` instead of `K`
        v += a[m][k] * b[k][n]; // you can access tensor elements just like operating a multi-level array, with optional out-of-bound check.
    }
    c[m][n] = v; // the modification will be reflected in the torch tensor in place, no redundant data copying.
}
//!cuda
""",
    float_bits=32,  # change to 16 to use half precision as `float` type in the above source code.
    boundscheck=True, # turning on for debug and off for performance (to use full threads of a block), default is on.
    )

kernels.matmul(  # automatically discover the kernel function by its name (e.g. 'matmul'), just like a normal python module.
    a, b, c, M, N, K,  # directly pass tensors and scalars as arguments
    grid=(N // 16 + 1, M // 16 + 1),  # grid size (number of blocks to be executed)
    block=(16, 16, 1),  # block size (number of threads in each block)
)

assert torch.allclose(c, torch.mm(a, b))

Installation

pip install -U cutex --index-url "https://pypi.org/simple/"

Note:

  • You should install pytorch and nvcc manually, which are not automatically managed dependencies.
  • The //cuda and //!cuda comments are not mandatory, it works together with the VSCode extension for highlighting CUDA source in python docstring.

Change Log

# format: {pypi-version}+{git-commit-hash} - ["[CUDA]"] {description}
# "[CUDA]" means changes related to the cuda side Tensor API.

v0.3.8+HEAD - add boundscheck option to inline execution
v0.3.7+e48537 - bugfix: passing python float to a kernel that accept a __half type now works.
v0.3.6+4e9b41 - bugfix: v0.3.5 uses regex to replace bool, this may be confused with Tensor with bool dtype, this version revert v0.3.5 and use the wrapper to convert scalar type.
v0.3.5+8bdfbc - bugfix: bool scalar type automatically converted into int32_t.
v0.3.4+07b6af - bugfix: error report in jupyter cell.
v0.3.3+0dc015 - bugfix: error report should find in the whole file.
v0.3.2+bc47ee - enhanced the error report, accurate lineno in the python file; ensure gridDim and blockDim to be integers.
v0.3.1+b46561 - automatically send tensor to cuda in inline execution; scalars are const;
v0.3.0+b93dc6 - !NEW FEATURE! inline execution of CUDA code
v0.2.2+025fb1 - multiple enhancements.
    - [CUDA] fatal bug fixed checking OOB in `Tensor<Any,1>.size(dim:int)->int` function;
    - !NEW FEATURE! add `ceildiv(int, int)->int` API as a util function.
v0.2.1+dc4373 - [CUDA] add `Tensor.size(dim:int)->int` API.
v0.2.0+03c3c5 - [CUDA] !NEW FEATURE! declare Tensor type argument instead of its pointer.
v0.1.1+d088de - core features
    - basic automatic cuda context management;
    - basic automatic tensor type argument via `pycuda.driver.PointerHolderBase`;
    - basic out-of-boundary check;
    - easy to use `SourceModule` API.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cutex-0.3.9.tar.gz (15.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cutex-0.3.9-py3-none-any.whl (11.8 kB view details)

Uploaded Python 3

File details

Details for the file cutex-0.3.9.tar.gz.

File metadata

  • Download URL: cutex-0.3.9.tar.gz
  • Upload date:
  • Size: 15.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.10.13

File hashes

Hashes for cutex-0.3.9.tar.gz
Algorithm Hash digest
SHA256 5b2aff1d4d7727304f0d818c447b45ec407d013ada04218229e20c952865e07f
MD5 06bced346d906bd202a303223ca6e9aa
BLAKE2b-256 4321e8fbb3944a722d4d150f9db0e9d9e69ceec57c00ec9aa2d8678c88c787c2

See more details on using hashes here.

File details

Details for the file cutex-0.3.9-py3-none-any.whl.

File metadata

  • Download URL: cutex-0.3.9-py3-none-any.whl
  • Upload date:
  • Size: 11.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.10.13

File hashes

Hashes for cutex-0.3.9-py3-none-any.whl
Algorithm Hash digest
SHA256 b2b69d2aeac46fead3627addfb3d43065b077a841402f509ca21cc8b986650fd
MD5 4827fb4c1b00f5263764ff5184f3afc6
BLAKE2b-256 5eba18dfc542d1faf2162b5b834ab01766a729e261c0568a5a985348acb94e65

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page