Skip to main content

Pytorch routines for (Ker)nel (Mac)hines

Project description

KERMAC

Kermac is a collection of fused CUDA kernels meant for fast and memory efficient computation for kernel methods. Kermac makes heavy use of JIT (Just-in-time) compilation to generate custom CUDA kernels on demand. These compiled kernels are stored in a cache database so the JIT costs are only incurred once. Using debug=True in most kermac routines will print information related to the compilation and caching of these JIT CUDA kernels.

Kermac supports only Nvidia cards with capability of sm_80 or higher. This includes:

  • Server cards like A10, A100, H100, B100
  • Consumer cards like RTX 30xx, RTX 40xx, RTX 50xx

Kermac relies on cuda-core for JIT compilation which is supported for cuda toolkits 11.8 and 12.x. Because of cuda-core and nvmath packages no C++ compilation or wheel system is needed to install this library.

Installation

CUDA 12

pip install kermac[cu12]

CUDA 11

pip install kermac[cu11]

Linalg

linalg functionality depends on nvmath-python. This isn't a required dependency. To run linalg routines please do:

pip install nvmath-python[cu12]

or

pip install nvmath-python[cu11]

Examples

From a fresh environment you can do:

cdist.py

wget https://raw.githubusercontent.com/Kernel-Machines/kermac/refs/heads/master/examples/cdist.py
python cdist.py -d -p 1.0

cdist_grad.py

wget https://raw.githubusercontent.com/Kernel-Machines/kermac/refs/heads/master/examples/cdist_grad.py
python cdist_grad.py -d

build_a_kernel.py

Running build_a_kernel.py will batch compile quite a few different kernels on first run. Expect around 20 seconds of JIT compiling.

wget https://raw.githubusercontent.com/Kernel-Machines/kermac/refs/heads/master/examples/build_a_kernel.py
python build_a_kernel.py -d

linalg.py

wget https://raw.githubusercontent.com/Kernel-Machines/kermac/refs/heads/master/examples/linalg.py
python linalg.py

Function: linalg.solve_cholesky

Solves a symmetric system of equations like torch.linalg.cholesky. Wraps xpotrf and xpotrs from nvmath.bindings.cusolverDn. This implementation is special because it doesn't synchronize with the cpu on a failed cholesky factor. Additionally this routine can write the factorization in-place to the input matrix. In some cases this avoids a full 2x increase in memory usage. It launches a separate cuda-stream for each of the batches passed in. It does require a bit of workspace memory allocation for each stream. It synchronizes the cuda-streams against the current stream at the end of the routine.

Function: linalg.solve_lu

Solves a symmetric system of equations like torch.linalg.solve. Wraps xgetrf and xgetrs from nvmath.bindings.cusolverDn. This implementation is special because it doesn't synchronize with the cpu on a failed LU decomposition. Additionally this routine can write the factorization in-place to the input matrix. In some cases this avoids a full 2x increase in memory usage. It launches a separate cuda-stream for each of the batches passed in. It does require a bit of workspace memory allocation for each stream. It synchronizes the cuda-streams against the current stream at the end of the routine.

Function: linalg.eigh

Computes eigenvalues and eigenvectors of a symmetric matrix like torch.linalg.eigh Wraps xsyevd from nvmath.bindings.cusolverDn. This implementation is special because it doesn't synchronize with the cpu on a failed eigenvalue decomposition. Additionally this routine can write the eigenvector decomposition in-place to the input matrix. In some cases this avoids a full 2x increase in memory usage. It launches a separate cuda-stream for each of the batches passed in. It does require a bit of workspace memory allocation for each stream. It synchronizes the cuda-streams against the current stream at the end of the routine.

Function: cdist

An implementation of torch.cdist. Computes fractional norms. Supports batches and broadcasting. Aside from the out tensor in the out=None case does not allocate.

Computes:

$out_{n,m} = \left( \sum_{k=1}^{K} |b_{k,n} - a_{k,m}|^p \right)^{\frac{1}{p}}$

If instead skip_epilogue is set it computes:

$out_{n,m} = \sum_{k=1}^{K} |b_{k,n} - a_{k,m}|^p$

Or expressed in c-style it efficiently computes:

// a[K,M], b[K,N], out[N,M]
for (int m = 0; m < M; m++) {
    for (int n = 0; n < N; n++) {
        for (int k = 0; k < K; k++) {
            out[n,m] += pow(abs(b[k,n] - a[k,m]), p);
        }
        if (!skip_epilogue) {
            out[n,m] = pow(out[n,m], 1.0/p);
        }
    }
}

It has special code paths for $p=1.0$ and $p=2.0$ to avoid fractional power instructions.

kermac.cdist vs torch.cdist

with problem size $[M,N,K]$ = $[30000,30000,1024]$

GPU / p-norm Speed-up (×) kermac.cdist (ms) torch.cdist (ms)
GH200 · p = 1.0 29.1× 82 2,389
GH200 · p = 1.3 9.6× 453 4,360
GH200 · p = 2.0 5.2× 79 406
H100-PCIe · p = 1.0 27.0× 108 2,907
H100-PCIe · p = 1.3 9.4× 592 5,591
H100-PCIe · p = 2.0 3.3× 104 346
A100 · p = 1.0 15.4× 251 3,878
A100 · p = 1.3 9.4× 873 8,230
A100 · p = 2.0 0.9× 325 301
RTX 4090 · p = 1.0 52.6× 76 4,021
RTX 4090 · p = 1.3 11.8× 350 4,141
RTX 4090 · p = 2.0 3.4× 77 262

Function: run_kernel

This is a more customizable version of kermac.cdist, kermac.cdist is written on top of this. run_kernel allows a descriptor as one of it's arguments that can create fully fused kernel functions. You can specify the inner-norm type (abs(x), x*x, or pow(x,p)), the outer-norm type (x, sqrt(x), or pow(x,1/p)) and finally a laplace or gaussian epilogue. On first run a fully fused kernel will be JIT compiled and cached for future use. This function also allows broadcasting and batching of it's input tensors. See build_a_kernel.py for various examples of usage. It also allows batching and broadcasting of it's hyperparameters such as p, bandwidth, and regularization. See broadcast_kernel.py for examples of batching and broadcasting hyperparameters

Function: cdist_grad

Computes the gradient of cdist in the style like:

$out_{o,n,m} = \sum_{k=1}^{K} c_{o,k}a_{k,m}\mathrm{sgn}\left(d_{n,m}-b_{n,k}\right)\left|d_{n,m}-b_{n,k}\right|^{p-1}$

Or expressed in c-style it efficiently computes:

// a[K,M], b[N,K], c[O,K], d[N,M], out[O,N,M]
for (int m = 0; m < M; m++) {
    for (int n = 0; n < N; n++) {
        for (int o = 0; o < O; o++) {
            for (int k = 0; k < K; k++) {
                float diff = d[n,m] - b[n,k];
                out[o,n,m] += c[o,k] * a[k,m] * signum(diff) * pow(abs(diff), p - 1.0));
            }
        }
    }
}

Aside from the out tensor in the out=None case DOES NOT ALLOCATE It has special code paths for $p=1.0$ and $p=2.0$ to avoid fractional power instructions.

It's supposed to be used like:

  • $a_{k,m}$ is grad_kernel_matrix
  • $b_{n,k}$ is data_x
  • $c_{o,k}$ is coefficients
  • $d_{n,m}$ is data_z
  • $out_{o,n,k}$ is gradient

Tensors must satisfy

# Given tensors a,b,c,d,out and sizes M,N,O,K
# K is the contracted mode
assert a.shape == torch.Size([K,M])
assert b.shape == torch.Size([N,K])
assert c.shape == torch.Size([O,K])
assert d.shape == torch.Size([N,M])
assert out.shape == torch.Size([O,N,M])

assert a.stride(1) == 1
assert b.stride(1) == 1
assert c.stride(1) == 1
assert d.stride(1) == 1
assert out.stride(1) == 1

out = kermac.cdist_grad(a,b,c,d,out=out) # OK

Views are OK

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kermac-0.1.4.tar.gz (617.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

kermac-0.1.4-py3-none-any.whl (851.9 kB view details)

Uploaded Python 3

File details

Details for the file kermac-0.1.4.tar.gz.

File metadata

  • Download URL: kermac-0.1.4.tar.gz
  • Upload date:
  • Size: 617.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for kermac-0.1.4.tar.gz
Algorithm Hash digest
SHA256 97487ad23598c5f099147fea96c16b2e170a14789f0ed847ebc88db6e4db2ff7
MD5 2f6074475c04c0b641de172c8b8fcd8e
BLAKE2b-256 8ec51c9400e68dd7bcc50d9828b92ce1c5a9af0d4e4444166fa4b5fabc867f33

See more details on using hashes here.

File details

Details for the file kermac-0.1.4-py3-none-any.whl.

File metadata

  • Download URL: kermac-0.1.4-py3-none-any.whl
  • Upload date:
  • Size: 851.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for kermac-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 bec2b1c8f1277fe77ca40c5ad6f1a21483e55d8ea8c70183834ea96dc98b4c4c
MD5 3ab976549be5d0dddeb5ec912fa164f1
BLAKE2b-256 eeb95df50ac67bff464ca321771920b411e6f818aa5a9a57991986f37af6f410

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page