Pytorch routines for (Ker)nel (Mac)hines
Project description
KERMAC
Kermac is a collection of fused CUDA kernels meant for fast and memory efficient computation for kernel methods. Kermac makes heavy use of JIT (Just-in-time) compilation to generate custom CUDA kernels on demand. These compiled kernels are stored in a cache database so the JIT costs are only incurred once. Using debug=True in most kermac routines will print information related to the compilation and caching of these JIT CUDA kernels.
Kermac supports only Nvidia cards with capability of sm_80 or higher. This includes:
- Server cards like A10, A100, H100, B100
- Consumer cards like RTX 30xx, RTX 40xx, RTX 50xx
Kermac relies on cuda-core for JIT compilation which is supported for cuda toolkits 11.8 and 12.x. Because of cuda-core and nvmath packages no C++ compilation or wheel system is needed to install this library.
Installation
CUDA 12
pip install "kermac[cu12] @ git+https://github.com/Kernel-Machines/kermac"
CUDA 11
pip install "kermac[cu11] @ git+https://github.com/Kernel-Machines/kermac"
Linalg
linalg functionality depends on nvmath-python. This isn't a required dependency. To run linalg routines please do:
pip install nvmath-python[cu12]
or
pip install nvmath-python[cu11]
Examples
From a fresh environment you can do:
cdist.py
wget https://raw.githubusercontent.com/Kernel-Machines/kermac/refs/heads/master/examples/cdist.py
python cdist.py -d -p 1.0
cdist_grad.py
wget https://raw.githubusercontent.com/Kernel-Machines/kermac/refs/heads/master/examples/cdist_grad.py
python cdist_grad.py -d
build_a_kernel.py
Running build_a_kernel.py will batch compile quite a few different kernels on first run. Expect around 20 seconds of JIT compiling.
wget https://raw.githubusercontent.com/Kernel-Machines/kermac/refs/heads/master/examples/build_a_kernel.py
python build_a_kernel.py -d
linalg.py
wget https://raw.githubusercontent.com/Kernel-Machines/kermac/refs/heads/master/examples/linalg.py
python linalg.py
Function: linalg.solve_cholesky
Solves a symmetric system of equations like torch.linalg.cholesky. Wraps xpotrf and xpotrs from nvmath.bindings.cusolverDn. This implementation is special because it doesn't synchronize with the cpu on a failed cholesky factor. Additionally this routine can write the factorization in-place to the input matrix. In some cases this avoids a full 2x increase in memory usage. It launches a separate cuda-stream for each of the batches passed in. It does require a bit of workspace memory allocation for each stream. It synchronizes the cuda-streams against the current stream at the end of the routine.
Function: linalg.solve_lu
Solves a symmetric system of equations like torch.linalg.solve.
Wraps xgetrf and xgetrs from nvmath.bindings.cusolverDn. This implementation is special because it doesn't synchronize with the cpu on a failed LU decomposition. Additionally this routine can write the factorization in-place to the input matrix. In some cases this avoids a full 2x increase in memory usage. It launches a separate cuda-stream for each of the batches passed in. It does require a bit of workspace memory allocation for each stream. It synchronizes the cuda-streams against the current stream at the end of the routine.
Function: linalg.eigh
Computes eigenvalues and eigenvectors of a symmetric matrix like torch.linalg.eigh
Wraps xsyevd from nvmath.bindings.cusolverDn. This implementation is special because it doesn't synchronize with the cpu on a failed eigenvalue decomposition. Additionally this routine can write the eigenvector decomposition in-place to the input matrix. In some cases this avoids a full 2x increase in memory usage. It launches a separate cuda-stream for each of the batches passed in. It does require a bit of workspace memory allocation for each stream. It synchronizes the cuda-streams against the current stream at the end of the routine.
Function: cdist
An implementation of torch.cdist. Computes fractional norms. Supports batches and broadcasting. Aside from the out tensor in the out=None case does not allocate.
Computes:
$out_{n,m} = \left( \sum_{k=1}^{K} |b_{k,n} - a_{k,m}|^p \right)^{\frac{1}{p}}$
If instead skip_epilogue is set it computes:
$out_{n,m} = \sum_{k=1}^{K} |b_{k,n} - a_{k,m}|^p$
Or expressed in c-style it efficiently computes:
// a[K,M], b[K,N], out[N,M]
for (int m = 0; m < M; m++) {
for (int n = 0; n < N; n++) {
for (int k = 0; k < K; k++) {
out[n,m] += pow(abs(b[k,n] - a[k,m]), p);
}
if (!skip_epilogue) {
out[n,m] = pow(out[n,m], 1.0/p);
}
}
}
It has special code paths for $p=1.0$ and $p=2.0$ to avoid fractional power instructions.
kermac.cdist vs torch.cdist
with problem size $[M,N,K]$ = $[30000,30000,1024]$
| GPU / p-norm | Speed-up (×) | kermac.cdist (ms) | torch.cdist (ms) |
|---|---|---|---|
| GH200 · p = 1.0 | 29.1× | 82 | 2,389 |
| GH200 · p = 1.3 | 9.6× | 453 | 4,360 |
| GH200 · p = 2.0 | 5.2× | 79 | 406 |
| H100-PCIe · p = 1.0 | 27.0× | 108 | 2,907 |
| H100-PCIe · p = 1.3 | 9.4× | 592 | 5,591 |
| H100-PCIe · p = 2.0 | 3.3× | 104 | 346 |
| A100 · p = 1.0 | 15.4× | 251 | 3,878 |
| A100 · p = 1.3 | 9.4× | 873 | 8,230 |
| A100 · p = 2.0 | 0.9× | 325 | 301 |
| RTX 4090 · p = 1.0 | 52.6× | 76 | 4,021 |
| RTX 4090 · p = 1.3 | 11.8× | 350 | 4,141 |
| RTX 4090 · p = 2.0 | 3.4× | 77 | 262 |
Function: run_kernel
This is a more customizable version of kermac.cdist, kermac.cdist is written on top of this. run_kernel allows a descriptor as one of it's arguments that can create fully fused kernel functions. You can specify the inner-norm type (abs(x), x*x, or pow(x,p)), the outer-norm type (x, sqrt(x), or pow(x,1/p)) and finally a laplace or gaussian epilogue. On first run a fully fused kernel will be JIT compiled and cached for future use. This function also allows broadcasting and batching of it's input tensors. See build_a_kernel.py for various examples of usage. It also allows batching and broadcasting of it's hyperparameters such as p, bandwidth, and regularization. See broadcast_kernel.py for examples of batching and broadcasting hyperparameters
Function: cdist_grad
Computes the gradient of cdist in the style like:
$out_{o,n,m} = \sum_{k=1}^{K} c_{o,k}a_{k,m}\mathrm{sgn}\left(d_{n,m}-b_{n,k}\right)\left|d_{n,m}-b_{n,k}\right|^{p-1}$
Or expressed in c-style it efficiently computes:
// a[K,M], b[N,K], c[O,K], d[N,M], out[O,N,M]
for (int m = 0; m < M; m++) {
for (int n = 0; n < N; n++) {
for (int o = 0; o < O; o++) {
for (int k = 0; k < K; k++) {
float diff = d[n,m] - b[n,k];
out[o,n,m] += c[o,k] * a[k,m] * signum(diff) * pow(abs(diff), p - 1.0));
}
}
}
}
Aside from the out tensor in the out=None case DOES NOT ALLOCATE
It has special code paths for $p=1.0$ and $p=2.0$ to avoid fractional power instructions.
It's supposed to be used like:
- $a_{k,m}$ is
kernel_matrix - $b_{n,k}$ is
data_x - $c_{o,k}$ is
coefficients - $d_{n,m}$ is
data_z - $out_{o,n,k}$ is
gradient
Tensors must satisfy
# Given tensors a,b,c,d,out and sizes M,N,O,K
# K is the contracted mode
assert a.shape == torch.Size([K,M])
assert b.shape == torch.Size([N,K])
assert c.shape == torch.Size([O,K])
assert d.shape == torch.Size([N,M])
assert out.shape == torch.Size([O,N,M])
assert a.stride(1) == 1
assert b.stride(1) == 1
assert c.stride(1) == 1
assert d.stride(1) == 1
assert out.stride(1) == 1
out = kermac.cdist_grad(a,b,c,d,out=out) # OK
Views are OK
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file kermac-0.1.1.tar.gz.
File metadata
- Download URL: kermac-0.1.1.tar.gz
- Upload date:
- Size: 838.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0fa16f8ffe8868de5342800ba30d33a5fdbc7b4348255e4858818f2a4038419f
|
|
| MD5 |
ad3471425e04731d9fc6a4cd9565c052
|
|
| BLAKE2b-256 |
877aab9e2408baf2ab0e8b0a68b4addd6ab347ed344d29974118711ef26bbc3c
|
File details
Details for the file kermac-0.1.1-py3-none-any.whl.
File metadata
- Download URL: kermac-0.1.1-py3-none-any.whl
- Upload date:
- Size: 855.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0b39001a79f2cd5599744ba7cd780b728ea1e53d1a1d2edb27436da5c64fd99c
|
|
| MD5 |
f1931c7db0d9b71934fd6e8540f61be1
|
|
| BLAKE2b-256 |
e671f1a111b156ad4a5cb4c17e5a3c333949066902a48e99806354d407b072c6
|