Skip to main content

High-performance ML primitives, applications, and informative cost API — Triton + CuteDSL kernels for NVIDIA GPUs.

Project description

FlashLib

Slack Discord

A GPU library for classical machine-learning operators — kmeans, knn, ivf-flat, pca, svd, dbscan, hdbscan, umap, t-sne, regression, GEMM, and more — built on Triton and CuteDSL.

See the blog post for motivation, design, and benchmarks.

Installation

Install with pip:

pip install flashlib

From source:

git clone https://github.com/FlashML-org/flashlib.git
cd flashlib
pip install -e .

Usage

import torch
from flashlib import flash_kmeans

x = torch.randn(1_000_000, 128, device="cuda", dtype=torch.float32)
labels, centroids, n_iter = flash_kmeans(x, n_clusters=1024, max_iters=20)

Every primitive is exposed as a top-level flash_* function and as a sklearn-style class (KMeans, PCA, HDBSCAN, …).

Index-based primitives like IVF-Flat (GPU approximate nearest neighbours) build an index once and query it many times:

import torch
from flashlib import IVFFlat

db = torch.randn(1_000_000, 128, device="cuda")
queries = torch.randn(10_000, 128, device="cuda")

index = IVFFlat(nlist=1024, nprobe=16).fit(db)
distances, indices = index.kneighbors(queries, n_neighbors=10)  # squared L2

nprobe is the recall knob: at a fixed (nlist, nprobe) the probed candidate set — and thus recall — matches a reference IVF-Flat (FAISS / cuVS), so raising it trades speed for recall without changing the kernel.

Informative API

The flashlib.info submodule predicts runtime, FLOPs, and HBM bytes for any primitive in ~5 µs on pure CPU — useful for budgeting a pipeline before launching it, and small enough for an LLM agent to call in a GPU-less environment. It does not import torch, triton, or cutlass.

import flashlib.info as info

est = info.estimate("kmeans",
                    shape=(100_000, 64),
                    params={"K": 256, "max_iters": 20},
                    device="H200")
print(est.summary_line())

See the blog post for the full API, the tolerance-driven dispatch, and per-primitive benchmarks.

Coverage

The current release ships 16 high-level primitives across the following families:

family primitives
Clustering flash_kmeans, flash_dbscan, flash_hdbscan, flash_spectral_clustering
Nearest nbrs flash_knn, flash_ivf_flat (IVF-Flat ANN)
Decomposition flash_pca, flash_truncated_svd
Manifold flash_umap, flash_tsne
Regression flash_linear_regression, flash_ridge, flash_logistic_regression
Classification flash_multinomial_nb, flash_random_forest
Preprocessing flash_standard_scaler

Plus low-level linear-algebra primitives (cov_gemm, gram_gemm, ab_gemm, eigh, polar, msign, cholqr2, split_basis) and a Pareto-frontier set of multi-precision GEMM variants (gemm, gemm_tf32, gemm_3xtf32, gemm_bf16, gemm_fp16, gemm_fp16_x9, gemm_fp16_x3_kahan, gemm_ozaki2_int8, …).

Citation

@misc{yang2026flashlib,
  title  = {FlashLib: Bringing Flash Magic to Classical Machine Learning Operators},
  author = {Yang, Shuo and Xi, Haocheng and Zhao, Yilong and Mang, Qiuyang and
            Wang, Zhe and Sun, Shanlin and Keutzer, Kurt and Gonzalez, Joseph E. and
            Han, Song and Xu, Chenfeng and Stoica, Ion},
  year   = {2026},
  url    = {https://flashml-org.github.io/},
}

License

Apache License 2.0.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

flashlib-0.2.0.tar.gz (534.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

flashlib-0.2.0-py3-none-any.whl (645.7 kB view details)

Uploaded Python 3

File details

Details for the file flashlib-0.2.0.tar.gz.

File metadata

  • Download URL: flashlib-0.2.0.tar.gz
  • Upload date:
  • Size: 534.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for flashlib-0.2.0.tar.gz
Algorithm Hash digest
SHA256 7903ef0804a58bdea9d6ab96823ea36f873c54ebd52954a3853ec211b2a61f63
MD5 6dffa4bee83dbe49cb49b82487b478fe
BLAKE2b-256 8c17a35be3ab4d4b6ee6cb31dc1782afbb3bf4e442a55649fb511a31b75741c5

See more details on using hashes here.

File details

Details for the file flashlib-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: flashlib-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 645.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for flashlib-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 fc802d33311a538e49ad6827b2049dab4f2b427a3578faefe95e55256731fcc8
MD5 5d02a199b98e678815fe97fd5166f7a2
BLAKE2b-256 8afe880b6c083e73656fdf17b0569db6d490697e41ad40a3ce55f2bfff8d846c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page