cutex

PyCUDA based PyTorch Extension Made Easy

These details have not been verified by PyPI

Project links

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

Logo

PyCUDA based PyTorch Extension Made Easy

In a word, cutex bridges PyCUDA's just-in-time compilation with PyTorch's Tensor type.

cutex.SourceModule extends pycuda's SourceModule in following ways:

Designed to work seemlessly with pytorch Tensor type, Data-Distributed Parallel (DDP), and autograd.Function API.
Support efficient multi-dimensional torch.Tensor access with (efficient & optional) out-of-boundary check.
Enhanced automatic type conversion and error messages.

cutex.SourceModule works differently compared to PyTorch's official cuda extension guide in following ways:

It compiles lightning fast! Especially suitable for rapidly developing your favorite new algorithm.
Without boilerplate cpp wrappers, every user code goes within one python file.
It use raw CUDA syntax so that PyTorch's c++ API is not available, it is recommended to use either raw CUDA with cutex or python API with pytorch.

Example (inline CUDA API)

This is a new high level API for writing custom kernels since v0.3.0. You omit the signature of the kernel function and cutex.inline() will compile and run it according to the context on the cuda device. This facillitates a fluent switching between pytorch and pycuda.

import cutex
import torch
from torch import Tensor


def matmul(A:Tensor, B:Tensor):
    M, J = A.size()
    K, N = B.size()
    assert J == K
    gridDim = (cutex.ceildiv(N, 16), cutex.ceildiv(M, 16), 1)
    blockDim = (16, 16, 1)
    C = torch.empty((M, N), dtype=A.dtype, device=A.device)
    cutex.inline("""
    int m = blockIdx.y * blockDim.y + threadIdx.y;
    int n = blockIdx.x * blockDim.x + threadIdx.x;
    float v = 0.f;
    if (m >= M || n >= N) return;
    for (int k = 0; k < K; ++k) {
        v += A[m][k] * B[k][n];
    }
    C[m][n] = v;
    """, boundscheck=False)  # all local vars are captured into the kernel except for those with unknown types.
    return C


def test():
    M, N, K = 4, 4, 1
    A = torch.rand((M, K), dtype=torch.float32).cuda()
    B = torch.rand((K, N), dtype=torch.float32).cuda()
    torch.testing.assert_close(matmul(A, B), torch.mm(A, B))
    print(matmul(A, B)) 


test()

Local variables of Tensor and common scalar types (int, float, etc.) and special ones gridDim and blockDim are captured into the inline execution, as if they were in the same scope. The order of defining them does not matter, only have to be assigned before the inline execution. Multiple inline execution in the same python function is legal. When doing so, make gridDim and blockDim update their value before the next execution.

The tensors can be acccessed element-wise using multi-dimensional squared brackets [] as illustrated in the above example. It can be read and write, and the modifications would reflected directly to the pytorch tensor on cuda devices. By default, with the boundscheck option on, these brackets will check for out of bound error. While this is very useful for debugging novel algorithms, it will make use of more registers in the SM, so if you want to make full use of the SM register resources, e.g. using maximum block threads, you need to turn boundscheck off for best performance.

Unless explicitly specified, the float type in the CUDA part will be automatically replaced to the same type as the first local Tensor variable with float dtype, in the above example, it would be aligned with A.

Example (lower level SourceModule API)

The following example demonstrates a vanilla matrix multiplication implementation for pytorch tensor but written in pure cuda. As you may happily notice, pytorch is responsible for allocation of new Tensors instead of in the cuda code, and the elements of tensors can be read and modified inside the kernel function.

import torch
import cutex

M, N, K = 4, 4, 1
a = torch.rand((M, K), dtype=torch.float32).cuda()
b = torch.rand((K, N), dtype=torch.float32).cuda()
c = torch.empty((M, N), dtype=torch.float32).cuda()

kernels = cutex.SourceModule("""
//cuda
__global__ void matmul(Tensor<float, 2> a, Tensor<float, 2> b, Tensor<float, 2> c, int M, int N, int K) {
    int m = blockIdx.y * blockDim.y + threadIdx.y;
    int n = blockIdx.x * blockDim.x + threadIdx.x;
    float v = 0.f;
    if (m >= M || n >= N) return; // you can also write `a.size(0)` instead of `M`, `b.size(1)` instead of `N`
    for (int k = 0; k < K; ++k) { // you can also write `a.size(1)` instead of `K`
        v += a[m][k] * b[k][n]; // you can access tensor elements just like operating a multi-level array, with optional out-of-bound check.
    }
    c[m][n] = v; // the modification will be reflected in the torch tensor in place, no redundant data copying.
}
//!cuda
""",
    float_bits=32,  # change to 16 to use half precision as `float` type in the above source code.
    boundscheck=True, # turning on for debug and off for performance (to use full threads of a block), default is on.
    )

kernels.matmul(  # automatically discover the kernel function by its name (e.g. 'matmul'), just like a normal python module.
    a, b, c, M, N, K,  # directly pass tensors and scalars as arguments
    grid=(N // 16 + 1, M // 16 + 1),  # grid size (number of blocks to be executed)
    block=(16, 16, 1),  # block size (number of threads in each block)
)

assert torch.allclose(c, torch.mm(a, b))

Installation

pip install -U cutex --index-url "https://pypi.org/simple/"

Note:

You should install pytorch and nvcc manually, which are not automatically managed dependencies.
The //cuda and //!cuda comments are not mandatory, it works together with the VSCode extension for highlighting CUDA source in python docstring.

Change Log

# format: {pypi-version}+{git-commit-hash} - ["[CUDA]"] {description}
# "[CUDA]" means changes related to the cuda side Tensor API.

v0.3.8+HEAD - add boundscheck option to inline execution
v0.3.7+e48537 - bugfix: passing python float to a kernel that accept a __half type now works.
v0.3.6+4e9b41 - bugfix: v0.3.5 uses regex to replace bool, this may be confused with Tensor with bool dtype, this version revert v0.3.5 and use the wrapper to convert scalar type.
v0.3.5+8bdfbc - bugfix: bool scalar type automatically converted into int32_t.
v0.3.4+07b6af - bugfix: error report in jupyter cell.
v0.3.3+0dc015 - bugfix: error report should find in the whole file.
v0.3.2+bc47ee - enhanced the error report, accurate lineno in the python file; ensure gridDim and blockDim to be integers.
v0.3.1+b46561 - automatically send tensor to cuda in inline execution; scalars are const;
v0.3.0+b93dc6 - !NEW FEATURE! inline execution of CUDA code
v0.2.2+025fb1 - multiple enhancements.
    - [CUDA] fatal bug fixed checking OOB in `Tensor<Any,1>.size(dim:int)->int` function;
    - !NEW FEATURE! add `ceildiv(int, int)->int` API as a util function.
v0.2.1+dc4373 - [CUDA] add `Tensor.size(dim:int)->int` API.
v0.2.0+03c3c5 - [CUDA] !NEW FEATURE! declare Tensor type argument instead of its pointer.
v0.1.1+d088de - core features
    - basic automatic cuda context management;
    - basic automatic tensor type argument via `pycuda.driver.PointerHolderBase`;
    - basic out-of-boundary check;
    - easy to use `SourceModule` API.

Project details

These details have not been verified by PyPI

Project links

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

0.3.9

Mar 22, 2024

0.3.8

Oct 15, 2023

0.3.7

Oct 14, 2023

0.3.6

Oct 11, 2023

0.3.5

Oct 11, 2023

0.3.4

Oct 11, 2023

0.3.3

Oct 11, 2023

0.3.2

Oct 11, 2023

0.3.1

Oct 8, 2023

0.3.0

Oct 8, 2023

0.2.2

Oct 6, 2023

0.2.1

Oct 2, 2023

0.2.0

Oct 2, 2023

0.1.1

Apr 6, 2022

0.1.0

Apr 6, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cutex-0.3.9.tar.gz (15.5 kB view details)

Uploaded Mar 22, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

cutex-0.3.9-py3-none-any.whl (11.8 kB view details)

Uploaded Mar 22, 2024 Python 3

File details

Details for the file cutex-0.3.9.tar.gz.

File metadata

Download URL: cutex-0.3.9.tar.gz
Upload date: Mar 22, 2024
Size: 15.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.0.0 CPython/3.10.13

File hashes

Hashes for cutex-0.3.9.tar.gz
Algorithm	Hash digest
SHA256	`5b2aff1d4d7727304f0d818c447b45ec407d013ada04218229e20c952865e07f`
MD5	`06bced346d906bd202a303223ca6e9aa`
BLAKE2b-256	`4321e8fbb3944a722d4d150f9db0e9d9e69ceec57c00ec9aa2d8678c88c787c2`

See more details on using hashes here.

File details

Details for the file cutex-0.3.9-py3-none-any.whl.

File metadata

Download URL: cutex-0.3.9-py3-none-any.whl
Upload date: Mar 22, 2024
Size: 11.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.0.0 CPython/3.10.13

File hashes

Hashes for cutex-0.3.9-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b2b69d2aeac46fead3627addfb3d43065b077a841402f509ca21cc8b986650fd`
MD5	`4827fb4c1b00f5263764ff5184f3afc6`
BLAKE2b-256	`5eba18dfc542d1faf2162b5b834ab01766a729e261c0568a5a985348acb94e65`

See more details on using hashes here.

cutex 0.3.9

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

PyCUDA based PyTorch Extension Made Easy

Example (inline CUDA API)

Example (lower level SourceModule API)

Installation

Change Log

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes