Skip to main content

Pure-Rust GPU compute substrate with Python bindings. cuda-oxide-compiled Stockham FFT.

Project description

ferrum-gpu

Pure-Rust GPU compute substrate with Python bindings. FFT kernels run on NVIDIA GPUs today via cuda-oxide (Rust source compiled to PTX, no CUDA C). Cross-vendor support via spirv-oxide → Vulkan is the v0.2 roadmap.

This is v0.1.0. The workspace ships:

  • ferrum-gpu-core: Backend trait, KernelArtifact, errors. no_std + alloc.
  • ferrum-gpu-cuda: impl Backend for Cuda over cudarc 0.19.
  • ferrum-gpu: facade with Device<B> and Buffer<T, B>.
  • ferrum-gpu-fft: 1D + 2D radix-2 power-of-2 C2C FFT host scaffolding + CPU Stockham reference.
  • ferrum-gpu-py: Python bindings via PyO3 + maturin. ferrum_gpu.cuda.Device(0) persistent handle + ferrum_gpu.fft.fft_1d_c2c_pow2 + ferrum_gpu.fft.fft_2d_c2c_pow2.
  • ferrum-gpu-bench: cuFFT comparison binary (1D, batched).
  • examples/vector-add: end-to-end demo using hand-written PTX through the substrate.
  • examples/vector-add-cuda-oxide: same kernel in Rust, compiled to PTX by cuda-oxide.
  • examples/fft-1d-c2c: 1D Stockham FFT in Rust, GPU-vs-CPU on 8 cases (N from 4 to 4096, batched, forward + inverse).

29 GPU pytest cases verified end-to-end against numpy.fft.fft / numpy.fft.fft2 (1D: 16 cases, 2D: 13 cases) within 1e-3 to 1e-4 relative error.

Requirements

  • Linux x86_64
  • CUDA Toolkit 13.x
  • NVIDIA driver compatible with the installed Toolkit
  • Rust nightly 2026-04-03 (pinned via rust-toolchain.toml)
  • cargo-oxide: cargo install --git https://github.com/NVlabs/cuda-oxide.git cargo-oxide
  • For the Python bindings: Python 3.10+ with maturin + numpy + pytest

Quick start: vector-add via hand-written PTX

git clone https://github.com/alejandro-soto-franco/ferrum-gpu
cd ferrum-gpu
make example-vector-add

Expected:

vector_add: 1048576 elements verified

Quick start: vector-add via Rust source + cuda-oxide

cargo install --git https://github.com/NVlabs/cuda-oxide.git cargo-oxide
cargo oxide doctor       # one-time codegen-backend bootstrap
make example-vector-add-oxide

Expected:

vector_add (cuda-oxide): 1048576 elements verified

Quick start: 1D Stockham FFT

make example-fft

Runs 8 cases (N=4 through N=4096, batched, forward + inverse), each verified against a CPU Stockham reference within 1e-4 relative error.

Quick start: Python

uv is the recommended Python package manager; the Makefile targets and the wheel install path work the same on pip for users who prefer it.

uv venv ~/.venvs/ferrum-gpu
source ~/.venvs/ferrum-gpu/bin/activate
uv pip install maturin pytest numpy
make develop                       # builds the cdylib + installs into the venv
python3 -c "
import numpy as np, ferrum_gpu as fg
arr = np.array([1+0j, 2+0j, 3+0j, 4+0j], dtype=np.complex64)
print(fg.fft.fft_1d_c2c_pow2(arr, log_n=2))
"

Pip equivalent:

python3 -m venv ~/.venvs/ferrum-gpu
source ~/.venvs/ferrum-gpu/bin/activate
pip install maturin pytest numpy
make develop

Run the pytest matrix:

make pytest

29 cases (16 1D + 13 2D), each compared against numpy.fft within 1e-3 to 1e-4 relative error.

Performance

make bench runs ferrum-gpu-bench, which times the in-tree cuda-oxide-compiled Stockham radix-2 power-of-2 C2C kernel against cuFFT (via cudarc 0.19's cufft feature) for batched 1D transforms at N in {256, 1024, 4096}, batch = 256, 100 trials per size + 10-trial warmup. Per-batch microseconds, measured on an RTX 5060 Laptop (sm_120):

N ferrum_us cufft_us ratio
256 0.089 0.016 5.52
1024 0.162 0.059 2.72
4096 0.548 0.080 6.86

The Stockham kernel is a single-block-per-FFT reference implementation with no radix-4 or warp-specialised stages, so cuFFT's vendor-tuned plan wins outright at these sizes. Closing the gap is on the v0.2 roadmap.

Testing

CPU-only tests: make test.

GPU tests + all examples + pytest (requires CUDA + NVIDIA GPU): make verify-all.

Publishing (PyPI wheel)

The public wheel is built inside a manylinux_2_28_x86_64 Docker image that ships CUDA Toolkit 13.x, the cuda-oxide-pinned Rust nightly, and maturin. The container is ~6-8 GB and takes ~15-25 minutes to build the first time.

make wheel-manylinux        # builds dist/ferrum_gpu-*-manylinux_2_28_x86_64.whl
auditwheel show dist/*.whl  # verify the manylinux tag

Publishing to PyPI is operator-driven (no CI):

# TestPyPI first
twine upload --repository testpypi dist/*.whl

# PyPI (requires a token in ~/.pypirc)
twine upload dist/*.whl

The local-build path (make develop + make wheel) produces a wheel tagged linux_x86_64 (not manylinux). Useful for local testing only.

License

Apache-2.0.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ferrum_gpu-0.1.0-cp310-abi3-manylinux_2_34_x86_64.whl (360.6 kB view details)

Uploaded CPython 3.10+manylinux: glibc 2.34+ x86-64

File details

Details for the file ferrum_gpu-0.1.0-cp310-abi3-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for ferrum_gpu-0.1.0-cp310-abi3-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 3e55e80f7dccadf26ce6d1fa0565a5165590ba0d7185d285b1c2cefa50377ce5
MD5 7002c72f19bd754a82fe662e34f2f584
BLAKE2b-256 6d07e320ccf59d145569f6c9fb354d700b2e8eb0ac4def7e7cf3f6099d6ad776

See more details on using hashes here.

Provenance

The following attestation bundles were made for ferrum_gpu-0.1.0-cp310-abi3-manylinux_2_34_x86_64.whl:

Publisher: release.yml on alejandro-soto-franco/ferrum-gpu

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page