Skip to main content

Pytorch routines for (Ker)nel (Mac)hines

Project description

KERMAC

Kermac is a collection of fused CUDA kernels meant for fast and memory efficient computation for kernel methods. Kermac makes heavy use of JIT (Just-in-time) compilation to generate custom CUDA kernels on demand. These compiled kernels are stored in a cache database so the JIT costs are only incurred once. Using debug=True in most kermac routines will print information related to the compilation and caching of these JIT CUDA kernels.

Kermac supports only Nvidia cards with capability of sm_80 or higher. This includes:

  • Server cards like A10, A100, H100, B100
  • Consumer cards like RTX 30xx, RTX 40xx, RTX 50xx

Kermac relies on cuda-core for JIT compilation which is supported for cuda toolkits 11.8 and 12.x. Because of cuda-core and nvmath packages no C++ compilation or wheel system is needed to install this library.

Installation

CUDA 12

pip install kermac[cu12]

CUDA 11

pip install kermac[cu11]

Linalg

linalg functionality depends on nvmath-python. This isn't a required dependency. To run linalg routines please do:

pip install nvmath-python[cu12]

or

pip install nvmath-python[cu11]

Examples

From a fresh environment you can do:

cdist.py

wget https://raw.githubusercontent.com/Kernel-Machines/kermac/refs/heads/master/examples/cdist.py
python cdist.py -d -p 1.0

cdist_grad.py

wget https://raw.githubusercontent.com/Kernel-Machines/kermac/refs/heads/master/examples/cdist_grad.py
python cdist_grad.py -d

build_a_kernel.py

Running build_a_kernel.py will batch compile quite a few different kernels on first run. Expect around 20 seconds of JIT compiling.

wget https://raw.githubusercontent.com/Kernel-Machines/kermac/refs/heads/master/examples/build_a_kernel.py
python build_a_kernel.py -d

linalg.py

wget https://raw.githubusercontent.com/Kernel-Machines/kermac/refs/heads/master/examples/linalg.py
python linalg.py

Function: linalg.solve_cholesky

Solves a symmetric system of equations like torch.linalg.cholesky. Wraps xpotrf and xpotrs from nvmath.bindings.cusolverDn. This implementation is special because it doesn't synchronize with the cpu on a failed cholesky factor. Additionally this routine can write the factorization in-place to the input matrix. In some cases this avoids a full 2x increase in memory usage. It launches a separate cuda-stream for each of the batches passed in. It does require a bit of workspace memory allocation for each stream. It synchronizes the cuda-streams against the current stream at the end of the routine.

Function: linalg.solve_lu

Solves a symmetric system of equations like torch.linalg.solve. Wraps xgetrf and xgetrs from nvmath.bindings.cusolverDn. This implementation is special because it doesn't synchronize with the cpu on a failed LU decomposition. Additionally this routine can write the factorization in-place to the input matrix. In some cases this avoids a full 2x increase in memory usage. It launches a separate cuda-stream for each of the batches passed in. It does require a bit of workspace memory allocation for each stream. It synchronizes the cuda-streams against the current stream at the end of the routine.

Function: linalg.eigh

Computes eigenvalues and eigenvectors of a symmetric matrix like torch.linalg.eigh Wraps xsyevd from nvmath.bindings.cusolverDn. This implementation is special because it doesn't synchronize with the cpu on a failed eigenvalue decomposition. Additionally this routine can write the eigenvector decomposition in-place to the input matrix. In some cases this avoids a full 2x increase in memory usage. It launches a separate cuda-stream for each of the batches passed in. It does require a bit of workspace memory allocation for each stream. It synchronizes the cuda-streams against the current stream at the end of the routine.

Function: cdist

An implementation of torch.cdist. Computes fractional norms. Supports batches and broadcasting. Aside from the out tensor in the out=None case does not allocate.

Computes:

$out_{n,m} = \left( \sum_{k=1}^{K} |b_{k,n} - a_{k,m}|^p \right)^{\frac{1}{p}}$

If instead skip_epilogue is set it computes:

$out_{n,m} = \sum_{k=1}^{K} |b_{k,n} - a_{k,m}|^p$

Or expressed in c-style it efficiently computes:

// a[K,M], b[K,N], out[N,M]
for (int m = 0; m < M; m++) {
    for (int n = 0; n < N; n++) {
        for (int k = 0; k < K; k++) {
            out[n,m] += pow(abs(b[k,n] - a[k,m]), p);
        }
        if (!skip_epilogue) {
            out[n,m] = pow(out[n,m], 1.0/p);
        }
    }
}

It has special code paths for $p=1.0$ and $p=2.0$ to avoid fractional power instructions.

kermac.cdist vs torch.cdist

with problem size $[M,N,K]$ = $[30000,30000,1024]$

GPU / p-norm Speed-up (×) kermac.cdist (ms) torch.cdist (ms)
GH200 · p = 1.0 29.1× 82 2,389
GH200 · p = 1.3 9.6× 453 4,360
GH200 · p = 2.0 5.2× 79 406
H100-PCIe · p = 1.0 27.0× 108 2,907
H100-PCIe · p = 1.3 9.4× 592 5,591
H100-PCIe · p = 2.0 3.3× 104 346
A100 · p = 1.0 15.4× 251 3,878
A100 · p = 1.3 9.4× 873 8,230
A100 · p = 2.0 0.9× 325 301
RTX 4090 · p = 1.0 52.6× 76 4,021
RTX 4090 · p = 1.3 11.8× 350 4,141
RTX 4090 · p = 2.0 3.4× 77 262

Function: run_kernel

This is a more customizable version of kermac.cdist, kermac.cdist is written on top of this. run_kernel allows a descriptor as one of it's arguments that can create fully fused kernel functions. You can specify the inner-norm type (abs(x), x*x, or pow(x,p)), the outer-norm type (x, sqrt(x), or pow(x,1/p)) and finally a laplace or gaussian epilogue. On first run a fully fused kernel will be JIT compiled and cached for future use. This function also allows broadcasting and batching of it's input tensors. See build_a_kernel.py for various examples of usage. It also allows batching and broadcasting of it's hyperparameters such as p, bandwidth, and regularization. See broadcast_kernel.py for examples of batching and broadcasting hyperparameters

Function: cdist_grad

Computes the gradient of cdist in the style like:

$out_{o,n,m} = \sum_{k=1}^{K} c_{o,k}a_{k,m}\mathrm{sgn}\left(d_{n,m}-b_{n,k}\right)\left|d_{n,m}-b_{n,k}\right|^{p-1}$

Or expressed in c-style it efficiently computes:

// a[K,M], b[N,K], c[O,K], d[N,M], out[O,N,M]
for (int m = 0; m < M; m++) {
    for (int n = 0; n < N; n++) {
        for (int o = 0; o < O; o++) {
            for (int k = 0; k < K; k++) {
                float diff = d[n,m] - b[n,k];
                out[o,n,m] += c[o,k] * a[k,m] * signum(diff) * pow(abs(diff), p - 1.0));
            }
        }
    }
}

Aside from the out tensor in the out=None case DOES NOT ALLOCATE It has special code paths for $p=1.0$ and $p=2.0$ to avoid fractional power instructions.

It's supposed to be used like:

  • $a_{k,m}$ is kernel_matrix
  • $b_{n,k}$ is data_x
  • $c_{o,k}$ is coefficients
  • $d_{n,m}$ is data_z
  • $out_{o,n,k}$ is gradient

Tensors must satisfy

# Given tensors a,b,c,d,out and sizes M,N,O,K
# K is the contracted mode
assert a.shape == torch.Size([K,M])
assert b.shape == torch.Size([N,K])
assert c.shape == torch.Size([O,K])
assert d.shape == torch.Size([N,M])
assert out.shape == torch.Size([O,N,M])

assert a.stride(1) == 1
assert b.stride(1) == 1
assert c.stride(1) == 1
assert d.stride(1) == 1
assert out.stride(1) == 1

out = kermac.cdist_grad(a,b,c,d,out=out) # OK

Views are OK

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kermac-0.1.2.tar.gz (838.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

kermac-0.1.2-py3-none-any.whl (855.6 kB view details)

Uploaded Python 3

File details

Details for the file kermac-0.1.2.tar.gz.

File metadata

  • Download URL: kermac-0.1.2.tar.gz
  • Upload date:
  • Size: 838.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for kermac-0.1.2.tar.gz
Algorithm Hash digest
SHA256 f8a71a2fc0756758d2dfe0774e754004bc94280e2f31e25425fa552457a73f4f
MD5 c4a79a58d466b860a636af2ad43ba413
BLAKE2b-256 e9071cfd9daf343b3d348bd83179b6e25aaaf633f25a92c8e23ed8934c1b2b7c

See more details on using hashes here.

File details

Details for the file kermac-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: kermac-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 855.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for kermac-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 7e5463ae8efaf56b44337d5a60ece17e539944d13f3fa123219a1208e22d75ef
MD5 5b97a52b3959e1d6145e29ee8d7443fc
BLAKE2b-256 75ca54b4b3a9a410a3c78b9258defad0cf16cc74d4be68f587fba6db8ca6912a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page