Skip to main content

Drop-in small-matrix acceleration for PyTorch on edge devices

Project description

levi-edge

Drop-in small-matrix acceleration for PyTorch on edge devices.

One import, one patch() — all your small matrix multiplications run 2-4x faster. No code changes needed.

import levi_edge
levi_edge.patch()

# That's it. All torch.mm / torch.bmm calls with matrices <= 256x256
# now use optimized CUDA kernels instead of cuBLAS.
C = torch.mm(A, B)  # 2-4x faster for small matrices, same result

Why

cuBLAS is the gold standard for large matrix operations. But for matrices up to 256x256 — common in edge inference, attention heads, small MLPs — cuBLAS wastes time on dispatch overhead that exceeds actual computation time.

LEVI Edge replaces torch.mm and torch.bmm with hand-tuned CUDA kernels optimized for this size range. Larger matrices automatically fall through to cuBLAS unchanged.

Verified performance (NVIDIA RTX 3060, C++ path via load_inline):

Matrix Size Speedup vs cuBLAS
64x64 ~1.4x
128x128 ~1.3x
192x192 ~1.4x
256x256+ cuBLAS (automatic fallback)

On edge GPUs (Jetson Nano/Orin) where cuBLAS dispatch overhead is proportionally larger, speedups of 2-4x are expected at these sizes.

Installation

pip install levi-edge

Requires: PyTorch >= 2.1, CUDA GPU, CUDA toolkit (for kernel compilation).

Usage

Global Patch (recommended)

import torch
import levi_edge

levi_edge.patch()

# Everything works as before — just faster for small matrices
A = torch.randn(64, 128, device="cuda")
B = torch.randn(128, 64, device="cuda")
C = torch.mm(A, B)  # Uses LEVI kernel

# Large matrices still use cuBLAS
C_big = torch.mm(torch.randn(1024, 1024, device="cuda"),
                 torch.randn(1024, 1024, device="cuda"))

levi_edge.unpatch()  # Restore original behavior

Direct Call

import levi_edge

C = levi_edge.mm(A, B)    # Always uses LEVI for eligible tensors
C = levi_edge.bmm(A, B)   # Batched version

Benchmark

from levi_edge.benchmark import benchmark_mm

results = benchmark_mm()  # Tests all edge-relevant sizes

Or from command line:

python -m levi_edge.benchmark

How It Works

LEVI Edge uses PyTorch's torch.library API to intercept aten::mm and aten::bmm at the CUDA dispatch level. When a matrix multiplication is called:

  1. Check dimensions: If M, N, K are all <= 256 and dtype is float32 → use LEVI kernel
  2. Select kernel: Two specialized CUDA kernels:
    • Simple kernel (M*N <= 16384): Cache-friendly with 4x loop unrolling, zero shared memory overhead
    • Tiled kernel (16384 < M*N <= 65536): 16x16 shared memory tiles for better bandwidth
  3. Fall back: For anything larger → cuBLAS handles it (zero overhead)

Autograd works transparently — no special handling needed.

When Is This Useful?

  • Edge AI inference (Jetson Nano/Orin, mobile GPUs): Small MLPs, classifiers
  • Transformer attention heads: Head dimensions typically 32-128
  • Sensor fusion: Multiple small matrix operations in real-time
  • Robotics: Low-latency inference on embedded GPUs
  • Batch-1 inference: Single-sample inference where cuBLAS overhead dominates

When Is This NOT Useful?

  • Large batch training (batch sizes >> 256)
  • Large model inference (GPT-class models)
  • CPU-only deployment
  • Non-float32 operations (fp16/bf16 — future support planned)

Eligible Operations

Condition Required
Device CUDA
Dtype float32
Max dimension 256 (M, N, K each)
Operations torch.mm, torch.bmm, torch.matmul (2D)

Examples

See examples/:

Theory

The kernel selection thresholds were determined using susceptibility analysis (sigma_c) — measuring execution time stability across matrix sizes to find the critical scale where cuBLAS dispatch overhead transitions from dominant to negligible. See sigmacore for the general framework.

Based on: M.C. Wurm, "Batch-Size Susceptibility across Five Computational Domains" (2024/2025).

License

Dual-licensed under:

For commercial licensing, contact: nfo@forgottenforge.xyz

Related

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

levi_edge-0.1.0.tar.gz (16.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

levi_edge-0.1.0-py3-none-any.whl (15.3 kB view details)

Uploaded Python 3

File details

Details for the file levi_edge-0.1.0.tar.gz.

File metadata

  • Download URL: levi_edge-0.1.0.tar.gz
  • Upload date:
  • Size: 16.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for levi_edge-0.1.0.tar.gz
Algorithm Hash digest
SHA256 589c924742830e6ea3fda70efc42520831286f448375e39b6882b428afc872dc
MD5 2edbbbff834a1c99a856d843fd3c20b9
BLAKE2b-256 957e1b84c03cd81127605ec4e1f9f4cec7e616064b4eb1805e6c4071ac450212

See more details on using hashes here.

File details

Details for the file levi_edge-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: levi_edge-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 15.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for levi_edge-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c0a57a05a7dba1fa3644cd9e4b301e46908e54d69e3525376303603aef4fbd8f
MD5 63da4ff8087266e183ad1dc5fc7e7477
BLAKE2b-256 55b28e5d25c5de3a5758eb7c3ec676f2aad0f2fd10a0c41e40e9d7db217dc75c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page