GPU-accelerated Batched Orthogonal Matching Pursuit — up to 310x faster than scikit-learn

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

Ariel4

These details have not been verified by PyPI

Project links

Paper

Project description

Batched OMP: Fast Orthogonal Matching Pursuit for CPU and GPU

Paper: Efficient Batched CPU/GPU Implementation of Orthogonal Matching Pursuit for Python

Batched implementation of Orthogonal Matching Pursuit (OMP) using BLAS (CPU) and PyTorch (GPU). The fastest GPU implementation of OMP — up to 26x faster than SPAMS (C++) while being pure Python/PyTorch. Up to 310x faster than scikit-learn on GPU.

Benchmark plot

Speedup vs scikit-learn

Config	Best CPU	GPU	SPAMS (C++)	GPU vs SPAMS
Image patches (256×1024, S=32, B=5K)	5.5x	310x	12.0x	25.8x faster
Face recognition (8064×1207, S=30, B=1.2K)	4.9x†	71x	4.9x	14.6x faster
Audio (512×2048, S=64, B=5K)	4.5x	137x	11.4x	12.0x faster

† v0_blas (inverse Cholesky + Cython BLAS) wins on CPU for face recognition due to large n_features (8064).

Hardware: Intel Xeon 8559C, NVIDIA RTX PRO 6000 Blackwell 102 GB (AWS g7e.8xlarge)

There is no other production-ready GPU implementation of OMP. Existing alternatives either crash on overcomplete dictionaries (cr-sparse) or are CPU-only (sklearn, SPAMS). Batched OMP is the fastest OMP implementation available when you have a GPU.

When to use batched-omp

Have a GPU? Use batched-omp — there is nothing faster. 12-26x faster than SPAMS (C++) across all realistic configs
CPU only, want a sklearn drop-in? 4-5x faster, same API, no C dependencies
CPU only, maximum speed? SPAMS is faster (C++ with OpenMP) but harder to install and requires Python ≤3.11 (depends on deprecated numpy.distutils)
Few signals or small problems? sklearn is fine — batching helps most with hundreds+ of signals

Installation

pip install batched-omp

Requires Python 3.10+ and a C compiler (for Cython BLAS extensions). Installs PyTorch automatically.

For GPU support, install PyTorch with CUDA before installing batched-omp, or it will default to CPU-only PyTorch.

For development:

pip install -e ".[dev]"

Quick Start

from batched_omp import run_omp
import torch

# Dictionary X: (n_features, n_components), signals y: (n_samples, n_features)
X = torch.randn(256, 1024, dtype=torch.float64)
X /= X.norm(dim=0)  # normalize columns
y = torch.randn(5000, 256, dtype=torch.float64)

# Solve: find 32-sparse representations of y in dictionary X
coefs = run_omp(X, y, n_nonzero_coefs=32, normalize=False, fit_intercept=False)
# coefs: (5000, 1024) — sparse coefficient matrix

GPU — just move tensors to CUDA:

coefs_gpu = run_omp(X.cuda(), y.cuda(), n_nonzero_coefs=32,
                    normalize=False, fit_intercept=False)

NumPy arrays work too — they're converted to tensors internally:

coefs = run_omp(X_numpy, y_numpy, n_nonzero_coefs=32)

Drop-in sklearn Replacement

Swap one import and get automatic GPU acceleration — works with Pipeline, GridSearchCV, cross_val_score:

# Before (sklearn, CPU only):
from sklearn.linear_model import OrthogonalMatchingPursuit
omp = OrthogonalMatchingPursuit(n_nonzero_coefs=10)

# After (batched-omp, automatic GPU):
from batched_omp import BatchedOrthogonalMatchingPursuit
omp = BatchedOrthogonalMatchingPursuit(n_nonzero_coefs=10)

# Same API
omp.fit(X_train, y_train)
predictions = omp.predict(X_test)
coef = omp.coef_          # (n_features,) or (n_targets, n_features)
intercept = omp.intercept_ # float or (n_targets,)

Extra parameters beyond sklearn's API: device ("auto", "cpu", "cuda"), algorithm ("v0", "v0_blas"), normalize (default True).

API

run_omp(X, y, n_nonzero_coefs, precompute=True, tol=0.0,
        normalize=True, fit_intercept=True, alg='v0')

Parameter	Description
`X`	Dictionary matrix `(n_features, n_components)`. Tensor or ndarray.
`y`	Signal matrix `(n_samples, n_features)`. Tensor or ndarray.
`n_nonzero_coefs`	Maximum number of non-zero coefficients per sample.
`tol`	Residual norm threshold for early stopping. `0` disables.
`normalize`	Column-normalize `X` before solving (undo on output).
`fit_intercept`	Center `X` and `y` before solving.
`alg`	`'v0'` (default, inverse Cholesky — fastest) or `'v0_blas'` (CPU-only, Cython BLAS).

Returns a (n_samples, n_components) tensor of sparse coefficients.

Algorithms

All algorithms are batched — they solve B sparse coding problems in parallel using matrix operations instead of looping over samples.

Algorithm	Description	Best for
v0 (inverse Cholesky)	Updates the Cholesky inverse iteratively, avoiding a linear solve each iteration. Based on Zhu et al. 2020.	GPU (any size), CPU (small-medium n_features)
v0_blas	Cython BLAS variant of v0. Calls `dgemv`/`daxpy`/`idamax` directly via scipy.	CPU with large n_features (e.g. 8064)

Project Structure

src/batched_omp/           — the library (torch + numpy + scipy, no sklearn)
    omp.py                 — run_omp, omp_naive, omp_v0, omp_v0_blas
    utils.py               — batch_mm, innerp, cholesky_solve, elapsed_timer
    blas_kernels/           — Cython BLAS wrappers (daxpy, dgemv, dppsv, idamax)
benchmarks/
    benchmarks.py          — benchmarking harness with sklearn baseline
    plot_results.py        — generates the 3-panel figure above
    results/               — benchmark outputs and plots

Running Benchmarks

# Realistic configs (image patches, face recognition, audio)
python benchmarks/benchmarks.py all

# Parameter sweep: N x B x S heatmaps (takes ~20 min)
python benchmarks/benchmarks.py sweep
python benchmarks/plot_sweep.py

# Ablation study: isolates contribution of batching, Gram precomputation, inverse Cholesky
python benchmarks/benchmarks.py ablation
python benchmarks/plot_ablation.py

# CPU only
python benchmarks/benchmarks.py image_patches --no-gpu

Ablation results show that batching is the dominant optimization — processing all signals simultaneously provides 4-750x speedup over single-sample loops. Gram precomputation and inverse Cholesky contribute 1-2x each.

Citation

@article{lubonja2024efficient,
  title={Efficient batched CPU/GPU implementation of orthogonal matching pursuit for Python},
  author={Lubonja, Ariel and Pr{\"a}sius, Sebastian Kazmarek and Tran, Trac Duy},
  journal={arXiv preprint arXiv:2407.06434},
  year={2024}
}

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

Ariel4

These details have not been verified by PyPI

Project links

Paper

Release history Release notifications | RSS feed

This version

0.1.2

Mar 29, 2026

0.1.0

Mar 29, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

batched_omp-0.1.2.tar.gz (199.0 kB view details)

Uploaded Mar 29, 2026 Source

File details

Details for the file batched_omp-0.1.2.tar.gz.

File metadata

Download URL: batched_omp-0.1.2.tar.gz
Upload date: Mar 29, 2026
Size: 199.0 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for batched_omp-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`3fa6aa77c5e7997593ae1a9451218768c6fa8442d2503dd7b187f77d76480666`
MD5	`4bfc79a09a4539988743109848049263`
BLAKE2b-256	`82fd341732e2e576a4937546e0985610d67c482cc4b6822d222db29a4ae5b53c`

See more details on using hashes here.

Provenance

The following attestation bundles were made for batched_omp-0.1.2.tar.gz:

Publisher: publish.yml on ariellubonja/orthogonal-matching-pursuit-gpu

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: batched_omp-0.1.2.tar.gz
- Subject digest: 3fa6aa77c5e7997593ae1a9451218768c6fa8442d2503dd7b187f77d76480666
- Sigstore transparency entry: 1195198533
- Sigstore integration time: Mar 29, 2026
Source repository:
- Permalink: ariellubonja/orthogonal-matching-pursuit-gpu@e28895f01da6236e596e9fbf1928c18c5a259c26
- Branch / Tag: refs/tags/v0.1.2
- Owner: https://github.com/ariellubonja
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@e28895f01da6236e596e9fbf1928c18c5a259c26
- Trigger Event: release

batched-omp 0.1.2

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Batched OMP: Fast Orthogonal Matching Pursuit for CPU and GPU

Speedup vs scikit-learn

When to use batched-omp

Installation

Quick Start

Drop-in sklearn Replacement

API

Algorithms

Project Structure

Running Benchmarks

Citation

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes

Provenance