Skip to main content

FP8 per-tile scaled linear solver for consumer NVIDIA GPUs

Project description

ssBlast

First open-source FP8 linear solver for consumer NVIDIA GPUs

Solves Ax = b using FP8 precision with per-tile scaling, delivering FP64-accurate results in 2-3× faster time than CuPy FP64. Works on any NVIDIA GPU from GTX 1060 to RTX 4090.

Why ssBlast

Tool FP8 Consumer GPU Open Source Speed
cuSOLVER Limited 1x
MAGMA Limited 1x
ssBlast 2-3x

Install

# Core (CPU fallback available)
pip install ssblast

# With FP8 Triton kernel (Linux/WSL2 + NVIDIA GPU)
pip install "ssblast[triton]"

Usage

from ssblast import solve
import cupy as cp

A = cp.random.randn(4000, 4000)
b = cp.random.randn(4000)

x = solve(A, b)  # FP64-accurate result in 0.19s
# vs CuPy FP64:   0.54s (2.9x slower)
# vs SciPy CPU:   0.71s (3.8x slower)

Benchmark — RTX 4050 Laptop GPU (WSL2, Triton 3.6.0)

Matrix SciPy CPU CuPy FP64 ssBlast FP8 Speedup
1000×1000 0.025s 0.026s 0.020s 1.3×
2000×2000 0.128s 0.121s 0.050s 2.4×
3000×3000 0.357s 0.293s 0.103s 2.8×
4000×4000 0.713s 0.542s 0.188s 2.9×
8000×8000 4.041s 2.066s 1.021s 2.0×
10000×10000 6.701s 4.026s 1.920s 2.1×

Performance characteristics:

  • Peak speedup ~3× at n=3000-4000 for RTX 40-series GPUs
  • Designed for large systems (n >= 2000)
  • All results FP64-accurate (max error < 1e-11)
  • Graceful fallback chain: FP8 → FP16 → FP32 → FP64 → CPU

How It Works

solve(A, b)
    ↓
Layer 1: Detect GPU (RTX 4050 → tier FP8)
    ↓
Layer 2: Select precision plan (FP8 + per-tile scaling)
    ↓
Layer 3: Dispatch to correct compute path
    ↓
Layer 4: FP8 Triton GEMM kernel
         Each 32×32 tile computes own scale factor
         Keeps values inside FP8 range ±447
         tl.dot automatically uses Tensor Cores
    ↓
Layer 5: Iterative refinement (GPU LU cached)
         Corrects FP8 rough solve → FP64 accuracy
    ↓
Output: FP64 correct solution x

GPU Support

GPU Tier Path Status
RTX 40xx FP8 Triton kernel ✅ Optimized
RTX 30xx FP16 CuPy cuBLAS ✅ Working
RTX 20xx FP16 CuPy cuBLAS ✅ Working
GTX 10xx FP32 CuPy cuBLAS ✅ Working
CPU only SciPy fallback ✅ Available

Novel Contribution

Per-tile FP8 scaling in ssblast/kernels/ssblast_kernel.py (~80 lines)

Each 32×32 tile independently computes its own scale factor. This means:

  • No global clipping (which loses precision)
  • Every FP8 region uses the full 0-255 value space
  • Computed in-kernel (no CPU overhead)

No equivalent exists in:

  • cuSOLVER (proprietary, no FP8)
  • MAGMA (no FP8 solver)
  • SLATE (CPU-focused)
  • Any open-source GPU solver

Test Results

pytest tests/   # 43/43 passing

Test coverage:

  • Unit tests: 33/33 pass (layers 0-5)
  • Final checks: 10/10 pass (production quality)
    • FP8 Triton path verified active
    • Accuracy stable across 10 runs
    • VRAM limit handling
    • Ill-conditioned matrices
    • Error messages clear

Requirements

  • Python ≥ 3.10
  • CUDA 12.x
  • NVIDIA GPU with compute capability ≥ 7.0
  • cupy-cuda12x, scipy, numpy
  • Optional: triton>=3.0.0, torch>=2.0 (for FP8 on Linux/WSL2)

Limitations

  • Linux/WSL2 only for FP8 path (Triton requirement)
  • Windows: falls back to FP16 path (still 2-3× faster than SciPy CPU)
  • Speedup only for n ≥ 2000 (refinement overhead at small n)
  • Input must fit in VRAM (max ~6 GB on consumer GPUs)

References

  • How to Optimize GEMM — FLAME Project, UT Austin. GotoBLAS/BLIS blocking strategy inspired the per-tile design in ssblast_kernel.py.
  • Higham, N.J. (2002). Accuracy and Stability of Numerical Algorithms (2nd ed.). SIAM.
  • OpenAI Triton — https://github.com/openai/triton

Author

SHARVESWAR MADASAMY — B.Tech CSE, SRM IST KTR

License

MIT — See LICENSE file

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ssblast-0.1.1.tar.gz (15.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ssblast-0.1.1-py3-none-any.whl (10.4 kB view details)

Uploaded Python 3

File details

Details for the file ssblast-0.1.1.tar.gz.

File metadata

  • Download URL: ssblast-0.1.1.tar.gz
  • Upload date:
  • Size: 15.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for ssblast-0.1.1.tar.gz
Algorithm Hash digest
SHA256 7ff0a8b26e2bdd230d1ba5d96ca304604df47da489b83c18b50e79c382661fee
MD5 e44a190bec2101f66396f493bbd50e08
BLAKE2b-256 d3ba4445dbf7dbad2f1d6faade8406d36d29e33cd5f3f12430850ee2d97bd139

See more details on using hashes here.

File details

Details for the file ssblast-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: ssblast-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 10.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for ssblast-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 96f096962e6fb9837da3f5af42e2f10c5cb37010c246f2898a9197327033a4dc
MD5 578a528968ef221be1bc24c41291032b
BLAKE2b-256 7cda491eda717b82fba90e0bbdf046da5305cd1b72331c52f57156f8f3f612b4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page