Drop-in small-matrix acceleration for PyTorch on edge devices
Project description
levi-edge
Drop-in small-matrix acceleration for PyTorch on edge devices.
One import, one patch() — all your small matrix multiplications run 2-4x faster. No code changes needed.
import levi_edge
levi_edge.patch()
# That's it. All torch.mm / torch.bmm calls with matrices <= 256x256
# now use optimized CUDA kernels instead of cuBLAS.
C = torch.mm(A, B) # 2-4x faster for small matrices, same result
Why
cuBLAS is the gold standard for large matrix operations. But for matrices up to 256x256 — common in edge inference, attention heads, small MLPs — cuBLAS wastes time on dispatch overhead that exceeds actual computation time.
LEVI Edge replaces torch.mm and torch.bmm with hand-tuned CUDA kernels optimized for this size range. Larger matrices automatically fall through to cuBLAS unchanged.
Verified performance (NVIDIA RTX 3060, C++ path via load_inline):
| Matrix Size | Speedup vs cuBLAS |
|---|---|
| 64x64 | ~1.4x |
| 128x128 | ~1.3x |
| 192x192 | ~1.4x |
| 256x256+ | cuBLAS (automatic fallback) |
On edge GPUs (Jetson Nano/Orin) where cuBLAS dispatch overhead is proportionally larger, speedups of 2-4x are expected at these sizes.
Installation
pip install levi-edge
Requires: PyTorch >= 2.1, CUDA GPU, CUDA toolkit (for kernel compilation).
Usage
Global Patch (recommended)
import torch
import levi_edge
levi_edge.patch()
# Everything works as before — just faster for small matrices
A = torch.randn(64, 128, device="cuda")
B = torch.randn(128, 64, device="cuda")
C = torch.mm(A, B) # Uses LEVI kernel
# Large matrices still use cuBLAS
C_big = torch.mm(torch.randn(1024, 1024, device="cuda"),
torch.randn(1024, 1024, device="cuda"))
levi_edge.unpatch() # Restore original behavior
Direct Call
import levi_edge
C = levi_edge.mm(A, B) # Always uses LEVI for eligible tensors
C = levi_edge.bmm(A, B) # Batched version
Benchmark
from levi_edge.benchmark import benchmark_mm
results = benchmark_mm() # Tests all edge-relevant sizes
Or from command line:
python -m levi_edge.benchmark
How It Works
LEVI Edge uses PyTorch's torch.library API to intercept aten::mm and aten::bmm at the CUDA dispatch level. When a matrix multiplication is called:
- Check dimensions: If M, N, K are all <= 256 and dtype is float32 → use LEVI kernel
- Select kernel: Two specialized CUDA kernels:
- Simple kernel (M*N <= 16384): Cache-friendly with 4x loop unrolling, zero shared memory overhead
- Tiled kernel (16384 < M*N <= 65536): 16x16 shared memory tiles for better bandwidth
- Fall back: For anything larger → cuBLAS handles it (zero overhead)
Autograd works transparently — no special handling needed.
When Is This Useful?
- Edge AI inference (Jetson Nano/Orin, mobile GPUs): Small MLPs, classifiers
- Transformer attention heads: Head dimensions typically 32-128
- Sensor fusion: Multiple small matrix operations in real-time
- Robotics: Low-latency inference on embedded GPUs
- Batch-1 inference: Single-sample inference where cuBLAS overhead dominates
When Is This NOT Useful?
- Large batch training (batch sizes >> 256)
- Large model inference (GPT-class models)
- CPU-only deployment
- Non-float32 operations (fp16/bf16 — future support planned)
Eligible Operations
| Condition | Required |
|---|---|
| Device | CUDA |
| Dtype | float32 |
| Max dimension | 256 (M, N, K each) |
| Operations | torch.mm, torch.bmm, torch.matmul (2D) |
Examples
See examples/:
basic_usage.py— Patch/unpatch, direct APIedge_inference.py— MobileNetV2 + MLP on edgetransformer_attention.py— Small transformer headsjetson_demo.py— Real-time inference loop with latency trackingbenchmark_all.py— Full benchmark suite
Theory
The kernel selection thresholds were determined using susceptibility analysis (sigma_c) — measuring execution time stability across matrix sizes to find the critical scale where cuBLAS dispatch overhead transitions from dominant to negligible. See sigmacore for the general framework.
Based on: M.C. Wurm, "Batch-Size Susceptibility across Five Computational Domains" (2024/2025).
License
Dual-licensed under:
- AGPL-3.0 for open-source / non-commercial use (license_AGPL.txt)
- Commercial license for proprietary integration (license_COMMERCIAL.txt)
For commercial licensing, contact: nfo@forgottenforge.xyz
Related
- batch-susceptibility — Optimal batch size finder for ML training
- sigmacore — The general sigma_c framework
- levi-gpu — The original LEVI library (CuPy-based)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file levi_edge-0.1.0.tar.gz.
File metadata
- Download URL: levi_edge-0.1.0.tar.gz
- Upload date:
- Size: 16.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
589c924742830e6ea3fda70efc42520831286f448375e39b6882b428afc872dc
|
|
| MD5 |
2edbbbff834a1c99a856d843fd3c20b9
|
|
| BLAKE2b-256 |
957e1b84c03cd81127605ec4e1f9f4cec7e616064b4eb1805e6c4071ac450212
|
File details
Details for the file levi_edge-0.1.0-py3-none-any.whl.
File metadata
- Download URL: levi_edge-0.1.0-py3-none-any.whl
- Upload date:
- Size: 15.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c0a57a05a7dba1fa3644cd9e4b301e46908e54d69e3525376303603aef4fbd8f
|
|
| MD5 |
63da4ff8087266e183ad1dc5fc7e7477
|
|
| BLAKE2b-256 |
55b28e5d25c5de3a5758eb7c3ec676f2aad0f2fd10a0c41e40e9d7db217dc75c
|