RIFFT zero-copy FFT bindings
Project description
RIFFT
RIFFT is a self-contained FFT runtime that combines RustFFT plans, Rayon parallelism, and optional
std::simd acceleration to deliver deterministic 2-D FFTs, fused forward/filter/inverse passes, and
zero-copy bridges into C and Python runtimes.
Quickstart
Rust:
use num_complex::Complex32;
use rifft::RifftHandle;
let handle = RifftHandle::new();
let mut plane = vec![Complex32::default(); 256 * 256];
handle.fft2d_forward(&mut plane, 256, 256)?;
# Ok::<(), rifft::types::RifftError>(())
Python:
import numpy as np
from rifft import fft2
x = (np.random.randn(256, 256) + 1j * np.random.randn(256, 256)).astype(np.complex64)
y = fft2(x)
Torch optional:
import torch
from rifft import fft2
x = torch.randn(256, 256, dtype=torch.complex64)
y = fft2(x)
When to reach for RIFFT
- Deterministic spectral filtering pipelines where you control the kernels and want repeatable CPU-only latency (e.g., classical vision, audio, radar, or scientific imaging).
- Embedding Torch/NumPy workloads into Rust without copying data: transfer ownership through DLPack and let RIFFT mutate in place, then return the capsule to Python.
- Services that batch a small set of FFT shapes (tile-based convolutions, patch-wise inference) and benefit from the built-in planner/workspace cache to skip repeated allocations.
- Interop tooling: export from a C++ or Python stack, run RIFFT’s fused forward→filter→inverse path, and re-import without touching disk or extra serialization layers.
If you primarily need GPU throughput, autograd support, many dtypes, or arbitrary dimension counts, stick to torch.fft / numpy.fft; RIFFT intentionally narrows the scope to keep the CPU kernels lean.
Features
- Planner + workspace cache keyed by geometry/direction/SIMD flag with per-thread scratch.
- Fused kernels (
fft → filter → ifft) with SIMD element-wise multiply helpers. - Zero-copy DLPack bridge for Torch tensors plus an optional C FFI surface.
- PyO3/Maturin bindings shipping the same runtime to Python (
pip install rifft). - Criterion benchmarks and Python harness for reproducible timing across standard sizes.
- Focused surface area: currently limited to complex64 2-D FFTs (single plane or batches). No autograd, dtype promotion, filtering helpers, or half/complex128 kernels—use NumPy/Torch for generic transforms and hand buffers to RIFFT only when you need the tuned path.
┌──────────────┐ ┌───────────────┐ ┌───────────────┐
│ DLPack inputs│ ──▶ │ Planner cache │ ──▶ │ Row FFT stage │
└──────────────┘ └───────────────┘ └───────────────┘
│ │
▼ ▼
┌──────────────┐ ┌───────────────┐ ┌───────────────┐
│ TLS scratch │ ◀── │ Workspace pool│ ◀── │ Column FFT/T │
└──────────────┘ └───────────────┘ └───────────────┘
│ │
▼ ▼
┌──────────────┐ ┌───────────────┐ ┌───────────────┐
│ SIMD fused │ ◀── │ C / Python FFI│ ◀── │ Torch adapters │
└──────────────┘ └───────────────┘ └───────────────┘
Installation
Rust crate
cargo add rifft
Python package
pip install rifft
# or with Torch + helpers
pip install "rifft[torch]"
Benchmarks use the Python environment you run them with; to include torch.fft
comparisons, ensure torch is installed in that interpreter.
Install from source for local development via
./scripts/install_python_binding.sh
# or RUSTUP_TOOLCHAIN=nightly maturin develop --release --features python
Optimised local builds
Prebuilt wheels target a broad set of CPUs, so they deliberately avoid the
-C target-cpu=native flag that unlocks the fastest kernels on your machine.
When installing from source you can mirror the settings used by
scripts/build_and_bench.sh by exporting the relevant environment variables
before invoking pip or maturin. For example:
RUSTUP_TOOLCHAIN=nightly \
RUSTFLAGS="-C target-cpu=native" \
RIFFT_FEATURES=python \
pip install --no-build-isolation --force-reinstall .
or equivalently:
RUSTUP_TOOLCHAIN=nightly \
RUSTFLAGS="-C target-cpu=native" \
./scripts/install_python_binding.sh --locked
Both commands build a wheel that is specialised for the local CPU and includes
the Python helpers. You can also set RIFFT_PREPLAN=auto (or a custom list) at
runtime to warm the planner cache, matching the defaults in
scripts/build_and_bench.sh.
Need a single-command installer? The helper scripts/install_native.sh detects
your architecture, infers a sensible thread count, sets the recommended flags,
and recompiles RIFFT from the main branch:
curl -fsSL https://raw.githubusercontent.com/benschneider/rifft/main/scripts/install_native.sh | bash
Pass any additional pip arguments (for example --user) after the -- separator:
curl -fsSL https://raw.githubusercontent.com/benschneider/rifft/main/scripts/install_native.sh | bash -s -- --user
The script always installs from source via --no-binary rifft, defaults
RIFFT_PREPLAN=auto, and sets RUSTFFT_THREADS to the available logical cores,
so the resulting build matches the settings used in our benchmarks. Override
RIFFT_GIT_REF if you need to pin a specific release.
Performance targets
| Size | Target (ms) | Baseline bench (ms) |
|---|---|---|
| 256×256 | ≤ 0.35 | 0.32 on M1 Max |
| 512×512 | ≤ 1.00 | 0.94 on M1 Max |
| 1024×1024 | ≤ 3.00 | 2.8 on M1 Max |
Benchmark Results
Hardware: M1 Max 32GB RAM
Environment: cpu=arm, threads=10, RUSTFLAGS=-C target-cpu=native
| Shape | Impl | Median (ms) | Mean (ms) | Std (ms) |
|---|---|---|---|---|
| (256, 256) | torch.fft | 0.208 | 0.210 | 0.009 |
| (256, 256) | numpy | 0.769 | 0.789 | 0.064 |
| (256, 256) | rifft.torch | 0.387 | 0.390 | 0.051 |
| (256, 256) | rifft.np | 0.297 | 0.282 | 0.094 |
| (256, 256) | rifft.np_out | 0.182 | 0.199 | 0.085 |
| (512, 512) | torch.fft | 0.976 | 0.990 | 0.058 |
| (512, 512) | numpy | 4.266 | 4.331 | 0.258 |
| (512, 512) | rifft.torch | 0.668 | 0.710 | 0.548 |
| (512, 512) | rifft.np | 0.468 | 0.517 | 0.146 |
| (512, 512) | rifft.np_out | 0.628 | 0.686 | 0.274 |
| (1024, 1024) | torch.fft | 6.904 | 6.894 | 0.781 |
| (1024, 1024) | numpy | n/a | n/a | n/a |
| (1024, 1024) | rifft.torch | 2.090 | 2.153 | 0.487 |
| (1024, 1024) | rifft.np | 1.732 | 1.979 | 1.131 |
| (1024, 1024) | rifft.np_out | 2.360 | 2.782 | 1.431 |
| (1536, 1536) | torch.fft | 16.637 | 16.732 | 0.905 |
| (1536, 1536) | numpy | n/a | n/a | n/a |
| (1536, 1536) | rifft.torch | 3.968 | 4.400 | 1.668 |
| (1536, 1536) | rifft.np | 3.607 | 4.284 | 2.200 |
| (1536, 1536) | rifft.np_out | 4.770 | 5.382 | 1.698 |
| (2048, 2048) | torch.fft | 39.082 | 39.407 | 1.199 |
| (2048, 2048) | numpy | n/a | n/a | n/a |
| (2048, 2048) | rifft.torch | 8.540 | 9.268 | 2.575 |
| (2048, 2048) | rifft.np | 8.401 | 9.315 | 3.188 |
| (2048, 2048) | rifft.np_out | 10.117 | 10.945 | 2.276 |
Criterion benches live under benches/ and report timing, bandwidth, and thread scaling.
Limits & assumptions
- Complex64 2-D only: RIFFT currently handles batches of
(H, W)planes stored row-major. No real-only, half precision, complex128, or >2D transforms. Callers must prepack complex data. - CPU, host memory: targets x86-64 and aarch64 CPUs with AVX2/AVX-512/NEON fast paths. There is no GPU backend. Inputs must be host-resident, contiguous buffers.
- Normalized inverse: every inverse path (plain or fused) scales by
1/(H*W)for consistency. If you need the raw RustFFT output, divide manually before calling into RIFFT. - Caching is bounded: planner caches evict once the small/large FIFO caps are reached (env vars
RUSTFFT_SMALL_CACHE,RUSTFFT_LARGE_CACHE). Filter spectra use an LRU with capacity governed byRIFFT_FILTER_CACHE(default 32). Repeatedly cycling through many shapes/filters will thrash. - Bench tolerances: Torch/NumPy equivalence checks allow up to
5e-4for ≤1M elements and1.25e-3for larger grids to absorb expected CPU rounding drift. Failures outside those limits are treated as correctness bugs. - Contiguous inputs: The Rust core assumes C-contiguous row-major layout; Python helpers canonicalize the layout (with at most one copy) before passing buffers across the FFI boundary.
Rust API
use num_complex::Complex32;
use rifft::RifftHandle;
let mut handle = RifftHandle::new();
let mut plane = vec![Complex32::default(); 512 * 512];
handle.fft2d_forward(&mut plane, 512, 512).unwrap();
// Or keep the result column-major for chaining:
handle
.fft2d_forward_transposed(&mut plane, 512, 512)
.unwrap();
The RifftHandle caches plans (height, width, direction, dtype, SIMD flag) and reuses aligned
workspace allocations for every call. Set RUSTFFT_THREADS to pin Rayon parallelism and
RUSTFFT_SMALL_MAX/RUSTFFT_SMALL_CACHE to tune the small-plan FIFO. On Linux runners with only one
or two logical cores (e.g., GitHub Actions VMs), RIFFT automatically drops to a single worker thread
to avoid fighting torch for bandwidth.
Set RIFFT_PREPLAN=auto (or a comma-separated list like RIFFT_PREPLAN=256x256,1024x512) to warm
up planner entries during RifftHandle::new().
C ABI
The shared library exports riff_create_handle, riff_fft2d_forward, riff_fft2d_inverse,
riff_fft2d_fused_filter, riff_get_version, and riff_get_backend_name. See
include/rifft.h for the full surface area.
Python bindings
- Built with Maturin + PyO3 (
pythonfeature). rifft.bridgeexposesfft2,ifft2,fft_filter_ifft, plus batched variants.rifftprovides a NumPy-first API that canonicalizes dtype/layout before calling the same backend.- CLI:
python -m rifft.bench --sizes 256 512 1024 --iters 25 --device cpu. - Higher-level helpers live in
rifft.helpersfor in-place Torch usage. rifft.runtimeexposespreplan,enable_timing,timing_reset, andtiming_summaryso you can warm caches and inspect kernel timings from Python.
import torch
from rifft.helpers import fft2_inplace, fft_filter_ifft_inplace
image = torch.randn(256, 256, dtype=torch.complex64)
fft2_inplace(image) # mutates `image`
kernel = torch.randn(256, 256, dtype=torch.complex64)
fft_filter_ifft_inplace(image, kernel) # runs forward→filter→inverse in place
NumPy usage
import numpy as np
from rifft import (
canonicalize_numpy,
rifft,
rifft_ifft,
rifft_out,
preplan,
enable_timing,
timing_reset,
timing_summary,
)
rng = np.random.default_rng(0)
x = (rng.standard_normal((2, 256, 256)) + 1j * rng.standard_normal((2, 256, 256))).astype(
np.complex64
)
canon = canonicalize_numpy(x) # zero-copy when already complex64 + C-contiguous
freq = rifft(canon)
spatial = rifft_ifft(freq.copy()) # normalized inverse (divides by H*W)
detached = rifft_out(x) # leaves `x` untouched and returns a new buffer
preplan([(256, 256), (512, 512)]) # warm planner
enable_timing(True)
timing_reset()
_ = rifft(np.copy(x))
print(timing_summary())
enable_timing(False)
canonicalize_numpy upgrades dtype to complex64 and enforces C-contiguity with at most one copy,
so rifft() can pass the buffer to the Rust kernel without branching on frameworks. Use
rifft_ifft(freq, normalize=False) if you want to skip the automatic 1/(H*W) scaling.
Performance debugging
from rifft import preplan, enable_timing, timing_reset, timing_summary, rifft
import numpy as np
preplan([(256, 256)]) # warm the planner cache for common shapes
enable_timing(True) # start collecting row/col/transpose timings
timing_reset()
data = (np.random.default_rng(0).standard_normal((256, 256)) + 1j * np.random.default_rng(1).standard_normal((256, 256))).astype(np.complex64)
rifft(data) # mutates in place
print(timing_summary()) # {'row_ms': ..., 'transpose_ms': ..., 'calls': ...}
enable_timing(False)
Use this pattern when you benchmark or pipeline repeated transforms: warm the planner once, run a few iterations, and inspect timing_summary() to see if time is spent in the row kernels, column kernels, or the transpose steps. The same helpers back the Python benchmark harness, so the numbers you print locally align with the values in results/rifft_benchmark.json.
Zero-copy DLPack
rifft.bridge converts torch.Tensor objects into DLPack capsules via torch.utils.dlpack. The
PyO3 shim unwraps the capsule, validates contiguity/alignment, and hands the raw pointer to the Rust
planner without copying. Ownership is transferred back to PyTorch after the transform so the
returned tensor shares memory with the accelerated path. See python/examples/torch_fft_demo.py
for a runnable snippet.
Build & test matrix
cargo build --release
maturin develop --features python
pytest python/tests -q
Benchmarks:
cargo bench --bench bench_fft256
cargo bench --bench bench_fft512
cargo bench --bench bench_fft1024
cargo bench --bench bench_fused
# End-to-end harness (builds the extension + runs comparisons)
./scripts/build_and_bench.sh
RIFFT_BENCH_INSTALL_TORCH=1 ./scripts/build_and_bench.sh
# Or run the Python benchmark directly:
python scripts/bench_rifft_compare.py
# include NumPy timings for every shape (skipping >512 by default):
python scripts/bench_rifft_compare.py --numpy-all
The benchmark script reports four implementations per shape: torch.fft, numpy, rifft.torch (PyTorch+DLPack path), rifft.numpy (mutates the NumPy buffer in place), and rifft.numpy_out (uses rifft_out, so it includes the single copy overhead). If PyTorch is unavailable (or you pass --skip-torch), the torch rows are omitted automatically. Pair these numbers with the optional timing summary to understand whether row kernels, column kernels, or transposes dominate a given workload.
PyTorch is only required for torch comparisons; install it via pip install "rifft[bench]" or set RIFFT_BENCH_INSTALL_TORCH=1 when using ./scripts/build_and_bench.sh.
Release helper
Use ./scripts/release.sh to format/lint/test the Rust crate, ensure maturin is ready, and then
optionally publish to crates.io and PyPI (each step prompts for confirmation).
Roadmap
- CUDA + Metal back-ends via opaque
TensorHandles. - Mixed-precision kernels (bf16/half) with on-the-fly promotion.
- Autotuned fuse graph builder for multi-filter workloads.
License
Licensed under either of
- Apache License, Version 2.0, (
LICENSE-APACHEor http://www.apache.org/licenses/LICENSE-2.0) - MIT license (
LICENSE-MITor http://opensource.org/licenses/MIT)
at your option. Any contributions are accepted under the same dual license.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file rifft-0.1.2.tar.gz.
File metadata
- Download URL: rifft-0.1.2.tar.gz
- Upload date:
- Size: 55.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5adaefa996e3055aa4e60f828b8e70b65199b22e7907ea7326b4278f5e7ecd64
|
|
| MD5 |
17e3f2b8eab9aff7fb3f7102a79367e2
|
|
| BLAKE2b-256 |
62c0c1476fa2288455999408c98e6ba65b96fd64c7d9358cea3123d0e2a9c3d5
|
Provenance
The following attestation bundles were made for rifft-0.1.2.tar.gz:
Publisher:
publish-python.yml on benschneider/rifft
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
rifft-0.1.2.tar.gz -
Subject digest:
5adaefa996e3055aa4e60f828b8e70b65199b22e7907ea7326b4278f5e7ecd64 - Sigstore transparency entry: 855598057
- Sigstore integration time:
-
Permalink:
benschneider/rifft@0a55c69cebba361256cb7fd082eb13deab4f1bd5 -
Branch / Tag:
refs/tags/v0.1.2 - Owner: https://github.com/benschneider
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-python.yml@0a55c69cebba361256cb7fd082eb13deab4f1bd5 -
Trigger Event:
release
-
Statement type:
File details
Details for the file rifft-0.1.2-cp39-abi3-manylinux_2_34_x86_64.whl.
File metadata
- Download URL: rifft-0.1.2-cp39-abi3-manylinux_2_34_x86_64.whl
- Upload date:
- Size: 801.7 kB
- Tags: CPython 3.9+, manylinux: glibc 2.34+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bcf54ed1dafad89a761120bf18a1839507d80afdf4b340825cb88c76027ce898
|
|
| MD5 |
b676773ef99cc548fc25f230a682b355
|
|
| BLAKE2b-256 |
f787bd039b2897298d3e937b776833d6d302870e03796182e8316d3cecf77d86
|
Provenance
The following attestation bundles were made for rifft-0.1.2-cp39-abi3-manylinux_2_34_x86_64.whl:
Publisher:
publish-python.yml on benschneider/rifft
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
rifft-0.1.2-cp39-abi3-manylinux_2_34_x86_64.whl -
Subject digest:
bcf54ed1dafad89a761120bf18a1839507d80afdf4b340825cb88c76027ce898 - Sigstore transparency entry: 855598114
- Sigstore integration time:
-
Permalink:
benschneider/rifft@0a55c69cebba361256cb7fd082eb13deab4f1bd5 -
Branch / Tag:
refs/tags/v0.1.2 - Owner: https://github.com/benschneider
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-python.yml@0a55c69cebba361256cb7fd082eb13deab4f1bd5 -
Trigger Event:
release
-
Statement type: