Drop-in small-matrix acceleration for PyTorch on edge devices

These details have not been verified by PyPI

Project links

Project description

levi-edge

Drop-in small-matrix acceleration for PyTorch on edge devices.

One import, one patch() — all your small matrix multiplications run 2-4x faster. No code changes needed.

import levi_edge
levi_edge.patch()

# That's it. All torch.mm / torch.bmm calls with matrices <= 256x256
# now use optimized CUDA kernels instead of cuBLAS.
C = torch.mm(A, B)  # 2-4x faster for small matrices, same result

Why

cuBLAS is the gold standard for large matrix operations. But for matrices up to 256x256 — common in edge inference, attention heads, small MLPs — cuBLAS wastes time on dispatch overhead that exceeds actual computation time.

LEVI Edge replaces torch.mm and torch.bmm with hand-tuned CUDA kernels optimized for this size range. Larger matrices automatically fall through to cuBLAS unchanged.

Verified performance (NVIDIA RTX 3060, C++ path via load_inline):

Matrix Size	Speedup vs cuBLAS
64x64	~1.4x
128x128	~1.3x
192x192	~1.4x
256x256+	cuBLAS (automatic fallback)

On edge GPUs (Jetson Nano/Orin) where cuBLAS dispatch overhead is proportionally larger, speedups of 2-4x are expected at these sizes.

Installation

pip install levi-edge

Requires: PyTorch >= 2.1, CUDA GPU, CUDA toolkit (for kernel compilation).

Usage

Global Patch (recommended)

import torch
import levi_edge

levi_edge.patch()

# Everything works as before — just faster for small matrices
A = torch.randn(64, 128, device="cuda")
B = torch.randn(128, 64, device="cuda")
C = torch.mm(A, B)  # Uses LEVI kernel

# Large matrices still use cuBLAS
C_big = torch.mm(torch.randn(1024, 1024, device="cuda"),
                 torch.randn(1024, 1024, device="cuda"))

levi_edge.unpatch()  # Restore original behavior

Direct Call

import levi_edge

C = levi_edge.mm(A, B)    # Always uses LEVI for eligible tensors
C = levi_edge.bmm(A, B)   # Batched version

Benchmark

from levi_edge.benchmark import benchmark_mm

results = benchmark_mm()  # Tests all edge-relevant sizes

Or from command line:

python -m levi_edge.benchmark

How It Works

LEVI Edge uses PyTorch's torch.library API to intercept aten::mm and aten::bmm at the CUDA dispatch level. When a matrix multiplication is called:

Check dimensions: If M, N, K are all <= 256 and dtype is float32 → use LEVI kernel
Select kernel: Two specialized CUDA kernels:
- Simple kernel (M*N <= 16384): Cache-friendly with 4x loop unrolling, zero shared memory overhead
- Tiled kernel (16384 < M*N <= 65536): 16x16 shared memory tiles for better bandwidth
Fall back: For anything larger → cuBLAS handles it (zero overhead)

Autograd works transparently — no special handling needed.

When Is This Useful?

Edge AI inference (Jetson Nano/Orin, mobile GPUs): Small MLPs, classifiers
Transformer attention heads: Head dimensions typically 32-128
Sensor fusion: Multiple small matrix operations in real-time
Robotics: Low-latency inference on embedded GPUs
Batch-1 inference: Single-sample inference where cuBLAS overhead dominates

When Is This NOT Useful?

Large batch training (batch sizes >> 256)
Large model inference (GPT-class models)
CPU-only deployment
Non-float32 operations (fp16/bf16 — future support planned)

Eligible Operations

Condition	Required
Device	CUDA
Dtype	float32
Max dimension	256 (M, N, K each)
Operations	`torch.mm`, `torch.bmm`, `torch.matmul` (2D)

Examples

See examples/:

basic_usage.py — Patch/unpatch, direct API
edge_inference.py — MobileNetV2 + MLP on edge
transformer_attention.py — Small transformer heads
jetson_demo.py — Real-time inference loop with latency tracking
benchmark_all.py — Full benchmark suite

Theory

The kernel selection thresholds were determined using susceptibility analysis (sigma_c) — measuring execution time stability across matrix sizes to find the critical scale where cuBLAS dispatch overhead transitions from dominant to negligible. See sigmacore for the general framework.

Based on: M.C. Wurm, "Batch-Size Susceptibility across Five Computational Domains" (2024/2025).

License

Dual-licensed under:

AGPL-3.0 for open-source / non-commercial use (license_AGPL.txt)
Commercial license for proprietary integration (license_COMMERCIAL.txt)

For commercial licensing, contact: nfo@forgottenforge.xyz

batch-susceptibility — Optimal batch size finder for ML training
sigmacore — The general sigma_c framework
levi-gpu — The original LEVI library (CuPy-based)

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Feb 2, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

levi_edge-0.1.0.tar.gz (16.4 kB view details)

Uploaded Feb 2, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

levi_edge-0.1.0-py3-none-any.whl (15.3 kB view details)

Uploaded Feb 2, 2026 Python 3

File details

Details for the file levi_edge-0.1.0.tar.gz.

File metadata

Download URL: levi_edge-0.1.0.tar.gz
Upload date: Feb 2, 2026
Size: 16.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for levi_edge-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`589c924742830e6ea3fda70efc42520831286f448375e39b6882b428afc872dc`
MD5	`2edbbbff834a1c99a856d843fd3c20b9`
BLAKE2b-256	`957e1b84c03cd81127605ec4e1f9f4cec7e616064b4eb1805e6c4071ac450212`

See more details on using hashes here.

File details

Details for the file levi_edge-0.1.0-py3-none-any.whl.

File metadata

Download URL: levi_edge-0.1.0-py3-none-any.whl
Upload date: Feb 2, 2026
Size: 15.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for levi_edge-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c0a57a05a7dba1fa3644cd9e4b301e46908e54d69e3525376303603aef4fbd8f`
MD5	`63da4ff8087266e183ad1dc5fc7e7477`
BLAKE2b-256	`55b28e5d25c5de3a5758eb7c3ec676f2aad0f2fd10a0c41e40e9d7db217dc75c`

See more details on using hashes here.

levi-edge 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

levi-edge

Why

Installation

Usage

Global Patch (recommended)

Direct Call

Benchmark

How It Works

When Is This Useful?

When Is This NOT Useful?

Eligible Operations

Examples

Theory

License

Related

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes