Pytorch routines for (Ker)nel (Mac)hines

These details have not been verified by PyPI

Project links

Project description

KERMAC

Kermac is a collection of fused CUDA kernels meant for fast and memory efficient computation for kernel methods. Kermac makes heavy use of JIT (Just-in-time) compilation to generate custom CUDA kernels on demand. These compiled kernels are stored in a cache database so the JIT costs are only incurred once. Using debug=True in most kermac routines will print information related to the compilation and caching of these JIT CUDA kernels.

Kermac supports only Nvidia cards with capability of sm_80 or higher. This includes:

Server cards like A10, A100, H100, B100
Consumer cards like RTX 30xx, RTX 40xx, RTX 50xx

Kermac relies on cuda-core for JIT compilation which is supported for cuda toolkits 11.8 and 12.x. Because of cuda-core and nvmath packages no C++ compilation or wheel system is needed to install this library.

Installation

CUDA 12

pip install "kermac[cu12] @ git+https://github.com/Kernel-Machines/kermac"

CUDA 11

pip install "kermac[cu11] @ git+https://github.com/Kernel-Machines/kermac"

Linalg

linalg functionality depends on nvmath-python. This isn't a required dependency. To run linalg routines please do:

pip install nvmath-python[cu12]

pip install nvmath-python[cu11]

Examples

From a fresh environment you can do:

`cdist.py`

wget https://raw.githubusercontent.com/Kernel-Machines/kermac/refs/heads/master/examples/cdist.py
python cdist.py -d -p 1.0

`cdist_grad.py`

wget https://raw.githubusercontent.com/Kernel-Machines/kermac/refs/heads/master/examples/cdist_grad.py
python cdist_grad.py -d

`build_a_kernel.py`

Running build_a_kernel.py will batch compile quite a few different kernels on first run. Expect around 20 seconds of JIT compiling.

wget https://raw.githubusercontent.com/Kernel-Machines/kermac/refs/heads/master/examples/build_a_kernel.py
python build_a_kernel.py -d

`linalg.py`

wget https://raw.githubusercontent.com/Kernel-Machines/kermac/refs/heads/master/examples/linalg.py
python linalg.py

Function: `linalg.solve_cholesky`

Solves a symmetric system of equations like torch.linalg.cholesky. Wraps xpotrf and xpotrs from nvmath.bindings.cusolverDn. This implementation is special because it doesn't synchronize with the cpu on a failed cholesky factor. Additionally this routine can write the factorization in-place to the input matrix. In some cases this avoids a full 2x increase in memory usage. It launches a separate cuda-stream for each of the batches passed in. It does require a bit of workspace memory allocation for each stream. It synchronizes the cuda-streams against the current stream at the end of the routine.

Function: `linalg.solve_lu`

Solves a symmetric system of equations like torch.linalg.solve. Wraps xgetrf and xgetrs from nvmath.bindings.cusolverDn. This implementation is special because it doesn't synchronize with the cpu on a failed LU decomposition. Additionally this routine can write the factorization in-place to the input matrix. In some cases this avoids a full 2x increase in memory usage. It launches a separate cuda-stream for each of the batches passed in. It does require a bit of workspace memory allocation for each stream. It synchronizes the cuda-streams against the current stream at the end of the routine.

Function: `linalg.eigh`

Computes eigenvalues and eigenvectors of a symmetric matrix like torch.linalg.eigh Wraps xsyevd from nvmath.bindings.cusolverDn. This implementation is special because it doesn't synchronize with the cpu on a failed eigenvalue decomposition. Additionally this routine can write the eigenvector decomposition in-place to the input matrix. In some cases this avoids a full 2x increase in memory usage. It launches a separate cuda-stream for each of the batches passed in. It does require a bit of workspace memory allocation for each stream. It synchronizes the cuda-streams against the current stream at the end of the routine.

Function: `cdist`

An implementation of torch.cdist. Computes fractional norms. Supports batches and broadcasting. Aside from the out tensor in the out=None case does not allocate.

Computes:

$out_{n,m} = \left( \sum_{k=1}^{K} |b_{k,n} - a_{k,m}|^p \right)^{\frac{1}{p}}$

If instead skip_epilogue is set it computes:

$out_{n,m} = \sum_{k=1}^{K} |b_{k,n} - a_{k,m}|^p$

Or expressed in c-style it efficiently computes:

// a[K,M], b[K,N], out[N,M]
for (int m = 0; m < M; m++) {
    for (int n = 0; n < N; n++) {
        for (int k = 0; k < K; k++) {
            out[n,m] += pow(abs(b[k,n] - a[k,m]), p);
        }
        if (!skip_epilogue) {
            out[n,m] = pow(out[n,m], 1.0/p);
        }
    }
}

It has special code paths for $p=1.0$ and $p=2.0$ to avoid fractional power instructions.

`kermac.cdist` vs `torch.cdist`

with problem size $[M,N,K]$ = $[30000,30000,1024]$

GPU / p-norm	Speed-up (×)	kermac.cdist (ms)	torch.cdist (ms)
GH200 · p = 1.0	29.1×	82	2,389
GH200 · p = 1.3	9.6×	453	4,360
GH200 · p = 2.0	5.2×	79	406
H100-PCIe · p = 1.0	27.0×	108	2,907
H100-PCIe · p = 1.3	9.4×	592	5,591
H100-PCIe · p = 2.0	3.3×	104	346
A100 · p = 1.0	15.4×	251	3,878
A100 · p = 1.3	9.4×	873	8,230
A100 · p = 2.0	0.9×	325	301
RTX 4090 · p = 1.0	52.6×	76	4,021
RTX 4090 · p = 1.3	11.8×	350	4,141
RTX 4090 · p = 2.0	3.4×	77	262

Function: `run_kernel`

This is a more customizable version of kermac.cdist, kermac.cdist is written on top of this. run_kernel allows a descriptor as one of it's arguments that can create fully fused kernel functions. You can specify the inner-norm type (abs(x), x*x, or pow(x,p)), the outer-norm type (x, sqrt(x), or pow(x,1/p)) and finally a laplace or gaussian epilogue. On first run a fully fused kernel will be JIT compiled and cached for future use. This function also allows broadcasting and batching of it's input tensors. See build_a_kernel.py for various examples of usage. It also allows batching and broadcasting of it's hyperparameters such as p, bandwidth, and regularization. See broadcast_kernel.py for examples of batching and broadcasting hyperparameters

Function: `cdist_grad`

Computes the gradient of cdist in the style like:

$out_{o,n,m} = \sum_{k=1}^{K} c_{o,k}a_{k,m}\mathrm{sgn}\left(d_{n,m}-b_{n,k}\right)\left|d_{n,m}-b_{n,k}\right|^{p-1}$

Or expressed in c-style it efficiently computes:

// a[K,M], b[N,K], c[O,K], d[N,M], out[O,N,M]
for (int m = 0; m < M; m++) {
    for (int n = 0; n < N; n++) {
        for (int o = 0; o < O; o++) {
            for (int k = 0; k < K; k++) {
                float diff = d[n,m] - b[n,k];
                out[o,n,m] += c[o,k] * a[k,m] * signum(diff) * pow(abs(diff), p - 1.0));
            }
        }
    }
}

Aside from the out tensor in the out=None case DOES NOT ALLOCATE It has special code paths for $p=1.0$ and $p=2.0$ to avoid fractional power instructions.

It's supposed to be used like:

$a_{k,m}$ is kernel_matrix
$b_{n,k}$ is data_x
$c_{o,k}$ is coefficients
$d_{n,m}$ is data_z
$out_{o,n,k}$ is gradient

Tensors must satisfy

# Given tensors a,b,c,d,out and sizes M,N,O,K
# K is the contracted mode
assert a.shape == torch.Size([K,M])
assert b.shape == torch.Size([N,K])
assert c.shape == torch.Size([O,K])
assert d.shape == torch.Size([N,M])
assert out.shape == torch.Size([O,N,M])

assert a.stride(1) == 1
assert b.stride(1) == 1
assert c.stride(1) == 1
assert d.stride(1) == 1
assert out.stride(1) == 1

out = kermac.cdist_grad(a,b,c,d,out=out) # OK

Views are OK

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.4

Oct 9, 2025

0.1.3

Aug 30, 2025

0.1.2

Aug 19, 2025

This version

0.1.1

Aug 19, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kermac-0.1.1.tar.gz (838.1 kB view details)

Uploaded Aug 19, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

kermac-0.1.1-py3-none-any.whl (855.6 kB view details)

Uploaded Aug 19, 2025 Python 3

File details

Details for the file kermac-0.1.1.tar.gz.

File metadata

Download URL: kermac-0.1.1.tar.gz
Upload date: Aug 19, 2025
Size: 838.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for kermac-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`0fa16f8ffe8868de5342800ba30d33a5fdbc7b4348255e4858818f2a4038419f`
MD5	`ad3471425e04731d9fc6a4cd9565c052`
BLAKE2b-256	`877aab9e2408baf2ab0e8b0a68b4addd6ab347ed344d29974118711ef26bbc3c`

See more details on using hashes here.

File details

Details for the file kermac-0.1.1-py3-none-any.whl.

File metadata

Download URL: kermac-0.1.1-py3-none-any.whl
Upload date: Aug 19, 2025
Size: 855.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for kermac-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0b39001a79f2cd5599744ba7cd780b728ea1e53d1a1d2edb27436da5c64fd99c`
MD5	`f1931c7db0d9b71934fd6e8540f61be1`
BLAKE2b-256	`e671f1a111b156ad4a5cb4c17e5a3c333949066902a48e99806354d407b072c6`

See more details on using hashes here.

kermac 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

KERMAC

Installation

CUDA 12

CUDA 11

Linalg

Examples

cdist.py

cdist_grad.py

build_a_kernel.py

linalg.py

Function: linalg.solve_cholesky

Function: linalg.solve_lu

Function: linalg.eigh

Function: cdist

kermac.cdist vs torch.cdist

Function: run_kernel

Function: cdist_grad

Tensors must satisfy

Views are OK

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`cdist.py`

`cdist_grad.py`

`build_a_kernel.py`

`linalg.py`

Function: `linalg.solve_cholesky`

Function: `linalg.solve_lu`

Function: `linalg.eigh`

Function: `cdist`

`kermac.cdist` vs `torch.cdist`

Function: `run_kernel`

Function: `cdist_grad`