Skip to main content

Optimized CUDA implementation of differentiable SSIM for PyTorch

Project description

fussim

Fast CUDA SSIM for PyTorch. Based on MrNeRF/optimized-fused-ssim.

~7x faster than pytorch-msssim | FP16/AMP support | Drop-in replacement

Installation

pip install fussim

Pre-built wheels for PyTorch 2.5-2.9 and CUDA 11.8-12.8:

pip install fussim --extra-index-url https://opsiclear.github.io/fussim/whl/
Build from source

Requires CUDA Toolkit and C++ compiler.

git clone https://github.com/OpsiClear/fussim.git
cd fussim
pip install .

Quick Start

import torch
from fussim import fused_ssim

img1 = torch.rand(1, 3, 256, 256, device="cuda", requires_grad=True)
img2 = torch.rand(1, 3, 256, 256, device="cuda")

# Compute SSIM
ssim_value = fused_ssim(img1, img2)

# Use as loss
loss = 1.0 - fused_ssim(img1, img2)
loss.backward()

API

fused_ssim

from fussim import fused_ssim

fused_ssim(img1, img2, padding="same", train=True, window_size=11) -> Tensor
Parameter Type Default Description
img1 Tensor required First image (B, C, H, W). Receives gradients.
img2 Tensor required Second image (B, C, H, W).
padding str "same" "same" or "valid"
train bool True Enable gradient computation
window_size int 11 Gaussian window: 7, 9, or 11

Returns: Scalar mean SSIM value.

ssim (pytorch-msssim compatible)

from fussim import ssim

ssim(X, Y, data_range=255, size_average=True, win_size=11, K=(0.01, 0.03), nonnegative_ssim=False) -> Tensor
Parameter Type Default Description
X Tensor required First image (B, C, H, W). Receives gradients.
Y Tensor required Second image (B, C, H, W).
data_range float 255 Value range (255 for uint8, 1.0 for normalized)
size_average bool True Return scalar mean or per-batch values
win_size int 11 Gaussian window: 7, 9, or 11
K tuple (0.01, 0.03) SSIM constants (K1, K2)
nonnegative_ssim bool False Clamp negative values to 0

SSIM Module

from fussim import SSIM

module = SSIM(data_range=1.0, size_average=True, win_size=11, K=(0.01, 0.03))
loss = 1 - module(pred, target)
loss.backward()

FP16 / AMP

with torch.autocast(device_type="cuda"):
    ssim_value = fused_ssim(img1, img2)  # Uses FP16 kernel automatically

Performance

RTX 4090, 5×5×1080×1920, 100 iterations:

Implementation Forward Backward Total Speedup
pytorch_msssim 28.7 ms 28.9 ms 57.5 ms 1.0x
fussim 4.38 ms 4.66 ms 9.04 ms 6.4x

Limitations

Parameter Constraint Reason
win_size 7, 9, or 11 CUDA kernel templates
win_sigma 1.5 (fixed) Hardcoded in kernel
win Not supported Uses built-in Gaussian

Attribution

Project Author
optimized-fused-ssim Janusch Patas
fused-ssim Rahul Goel

Citation

@software{optimized-fused-ssim,
    author = {Janusch Patas},
    title = {Optimized Fused-SSIM},
    year = {2025},
    url = {https://github.com/MrNeRF/optimized-fused-ssim},
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fussim-0.1.0.tar.gz (35.9 kB view details)

Uploaded Source

File details

Details for the file fussim-0.1.0.tar.gz.

File metadata

  • Download URL: fussim-0.1.0.tar.gz
  • Upload date:
  • Size: 35.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for fussim-0.1.0.tar.gz
Algorithm Hash digest
SHA256 ab28ab1db154f1b29c7d28334393ebcfe7b648649666489a83473f9d48bd73af
MD5 c931fcd887c63d8cbdd908f704de448f
BLAKE2b-256 4c73e998fe00b77fc733934a3649ca574518339d9159c9ec4f60f5db012a57c1

See more details on using hashes here.

Provenance

The following attestation bundles were made for fussim-0.1.0.tar.gz:

Publisher: publish.yml on OpsiClear/fussim

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page